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The Web site at WilliamStallings.com/COA/COA8e.html provides support for instructors 
and students using the book. It includes the following elements. 


= 
Course Support Materials 
es 


¢ A set of PowerPoint slides for use as lecture aids. 

¢ Copies of figures from the book in PDF format. 

¢ Copies of tables from the book in PDF format. 

¢ Computer Science Student Resource Site: contains a number of links and docu- 
ments that students may find useful in their ongoing computer science education. 
The site includes a review of basic, relevant mathematics; advice on research, 
writing, and doing homework problems; links to computer science research 
resources, such as report repositories and bibliographies; and other useful links. 

e An errata sheet for the book, updated at most monthly. 


| Supplemental Documents 
— e A set of supplemental homework problems with solutions. Students can en- 
hance their understanding of the material by working out the solutions to 
these problems and then checking their answers. 
e Three online chapters: number systems, digital logic, and IA-64 architecture 
e Nine online appendices that expand on the treatment in the book. Topics 
include recursion, and various topics related to memory. 
e All of the Intel x86 and ARM architecture material from the book reproduced 
in two PDF documents for easy reference. 
¢ Other useful documents 


COA Courses 


The Web site includes links to Web sites for courses taught using the book. These 
sites can provide useful ideas about scheduling and topic ordering, as well as a num- 
ber of useful handouts and other materials. 


@ Useful Web Sites 


The Web site includes links to relevant Web sites. The links cover a broad spectrum 
of topics and will enable students to explore timely issues in greater depth. 


= Internet Mailing List 
An Internet mailing list is maintained so that instructors using this book can ex- 
change information, suggestions, and questions with each other and the author. Sub- 
scription information is provided at the book’s Web site. 


g, Simulation Tools for COA Projects 
The Web site includes a number of interactive simulation tools, which are keyed to the 
topics of the book. The Web site also includes links to the SimpleScalar and SMPCache 
web sites. These are two software packages that serve as frameworks for project imple- 
mentation. Each site includes downloadable software and background information. 
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PREFACE 


OBJECTIVES 


This book is about the structure and function of computers. Its purpose is to present, as 
clearly and completely as possible, the nature and characteristics of modern-day computer 
systems. 

This task is challenging for several reasons. First, there is a tremendous variety of prod- 
ucts that can rightly claim the name of computer, from single-chip microprocessors costing a 
few dollars to supercomputers costing tens of millions of dollars. Variety is exhibited not 
only in cost, but also in size, performance, and application. Second, the rapid pace of change 
that has always characterized computer technology continues with no letup. These changes 
cover all aspects of computer technology, from the underlying integrated circuit technology 
used to construct computer components, to the increasing use of parallel organization con- 
cepts in combining those components. 

In spite of the variety and pace of change in the computer field, certain fundamental 
concepts apply consistently throughout. The application of these concepts depends on the 
current state of the technology and the price/performance objectives of the designer. The in- 
tent of this book is to provide a thorough discussion of the fundamentals of computer orga- 
nization and architecture and to relate these to contemporary design issues. 

The subtitle suggests the theme and the approach taken in this book. It has always 
been important to design computer systems to achieve high performance, but never has this 
requirement been stronger or more difficult to satisfy than today. All of the basic perfor- 
mance characteristics of computer systems, including processor speed, memory speed, mem- 
ory capacity, and interconnection data rates, are increasing rapidly. Moreover, they are 
increasing at different rates. This makes it difficult to design a balanced system that maxi- 
mizes the performance and utilization of all elements. Thus, computer design increasingly 
becomes a game of changing the structure or function in one area to compensate for a per- 
formance mismatch in another area. We will see this game played out in numerous design 
decisions throughout the book. 

A computer system, like any system, consists of an interrelated set of components. The 
system is best characterized in terms of structure —the way in which components are intercon- 
nected, and function—the operation of the individual components. Furthermore, a computer’s 
organization is hierarchical. Each major component can be further described by decomposing it 
into its major subcomponents and describing their structure and function. For clarity and ease 
of understanding, this hierarchical organization is described in this book from the top down: 


¢ Computer system: Major components are processor, memory, I/O. 

¢ Processor: Major components are control unit, registers, ALU, and instruction 
execution unit. 

¢ Control Unit: Provides control signals for the operation and coordination of all 
processor components. Traditionally, a microprogramming implementation has been 
used, in which major components are control memory, microinstruction sequencing 
logic, and registers. More recently, microprogramming has been less prominent but 
remains an important implementation technique. 
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The objective is to present the material in a fashion that keeps new material in a clear 
context. This should minimize the chance that the reader will get lost and should provide 
better motivation than a bottom-up approach. 

Throughout the discussion, aspects of the system are viewed from the points of view of 
both architecture (those attributes of a system visible to a machine language programmer) and 
organization (the operational units and their interconnections that realize the architecture). 


EXAMPLE SYSTEMS 


This text is intended to acquaint the reader with the design principles and implementation is- 
sues of contemporary operating systems. Accordingly, a purely conceptual or theoretical 
treatment would be inadequate. To illustrate the concepts and to tie them to real-world design 
choices that must be made, two processor families have been chosen as running examples: 


¢ Intel x86 architecture: The x86 architecture is the most widely used for non-embedded 
computer systems. The x86 is essentially a complex instruction set computer (CISC) with 
some RISC features. Recent members of the x86 family make use of superscalar and mul- 
ticore design principles. The evolution of features in the x86 architecture provides a unique 
case study of the evolution of most of the design principles in computer architecture. 

e ARM: The ARM embedded architecture is arguably the most widely used embedded 
processor, used in cell phones, iPods, remote sensor equipment, and many other de- 
vices. The ARM is essentially a reduced instruction set computer (RISC). Recent 
members of the ARM family make use of superscalar and multicore design principles. 


Many, but by no means all, of the examples are drawn from these two computer families: the 


Intel x86, and the ARM embedded processor family. Numerous other systems, both contempo- 
rary and historical, provide examples of important computer architecture design features. 


PLAN OF THE TEXT 


The book is organized into five parts (see Chapter 0 for an overview) 


¢ Overview 

¢ The computer system 

e The central processing unit 

e The control unit 

¢ Parallel organization, including multicore 

The book includes a number of pedagogic features, including the use of interactive sim- 

ulations and numerous figures and tables to clarify the discussion. Each chapter includes a 
list of key words, review questions, homework problems, suggestions for further reading, and 
recommended Web sites. The book also includes an extensive glossary, a list of frequently 
used acronyms, and a bibliography. 


INTENDED AUDIENCE 


The book is intended for both an academic and a professional audience. As a textbook, it is in- 
tended as a one- or two-semester undergraduate course for computer science, computer engi- 
neering, and electrical engineering majors. It covers all the topics in CS 220 Computer 
Architecture, which is one of the core subject areas in the IEEE/ACM Computer Curricula 2001. 
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For the professional interested in this field, the book serves as a basic reference volume 
and is suitable for self-study. 


INSTRUCTIONAL SUPPORT MATERIALS 


To support instructors, the following materials are provided: 


e Solutions manual: Solutions to end-of-chapter Review Questions and Problems 
¢ Projects manual: Suggested project assignments for all of the project categories 
listed below 
¢ PowerPoint slides: A set of slides covering all chapters, suitable for use in lecturing 
¢ PDF files: Reproductions of all figures and tables from the book 
e Test bank: Includes true/false, multiple choice, and fill-in-the-blanks questions 
and answers 
All of these support materials are available at the Instructor Resource Center (IRC) 
for this textbook. To gain access to the IRC, please contact your local Prentice Hall sales rep- 
resentative via prenhall.com/replocator or call Prentice Hall Faculty Services at 1-800-526- 
0485. You can also locate the IRC through http://www.pearsonhighered.com/stallings. 


INTERNET SERVICES FOR INSTRUCTORS AND STUDENTS 


There is a Web site for this book that provides support for students and instructors. The site 
includes links to other relevant sites and a set of useful documents. See the section, “Web 
Site for Computer Organization and Architecture,” preceding this Preface, for more infor- 
mation. The Web page is at williamstallings.com/COA/COA8e.html. 

New to this edition is a set of homework problems with solutions publicly available at 
this Web site. Students can enhance their understanding of the material by working out the 
solutions to these problems and then checking their answers. 

An Internet mailing list has been set up so that instructors using this book can ex- 
change information, suggestions, and questions with each other and with the author. As soon 
as typos or other errors are discovered, an errata list for this book will be available at 
WilliamStallings.com. Finally, I maintain the Computer Science Student Resource Site at 
WilliamStallings.com/StudentSupport.html. 


PROJECTS AND OTHER STUDENT EXERCISES 


For many instructors, an important component of a computer organization and architecture 
course is a project or set of projects by which the student gets hands-on experience to rein- 
force concepts from the text. This book provides an unparalleled degree of support for in- 
cluding a projects component in the course. The instructor’s support materials available 
through Prentice Hall not only includes guidance on how to assign and structure the projects 
but also includes a set of user’s manuals for various project types plus specific assignments, 
all written especially for this book. Instructors can assign work in the following areas: 


¢ Interactive simulation assignments: Described subsequently. 


¢ Research projects: A series of research assignments that instruct the student to re- 
search a particular topic on the Internet and write a report. 
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e Simulation projects: The IRC provides support for the use of the two simulation pack- 
ages: SimpleScalar can be used to explore computer organization and architecture 
design issues. SMPCache provides a powerful educational tool for examining cache 
design issues for symmetric multiprocessors. 

e Assembly language projects: A simplified assembly language, CodeBlue, is used and 
assignments based on the popular Core Wars concept are provided. 

¢ Reading/report assignments: A list of papers in the literature, one or more for each 
chapter, that can be assigned for the student to read and then write a short report. 

¢ Writing assignments: A list of writing assignments to facilitate learning the material. 

¢ Test bank: Includes T/F, multiple choice, and fill-in-the-blanks questions and answers. 


This diverse set of projects and other student exercises enables the instructor to use the 
book as one component in a rich and varied learning experience and to tailor a course plan to 
meet the specific needs of the instructor and students. See Appendix A in this book for details. 


INTERACTIVE SIMULATIONS 


New to this edition is the incorporation of interactive simulations. These simulations provide a 
powerful tool for understanding the complex design features of a modern computer system. A 
total of 20 interactive simulations are used to illustrate key functions and algorithms in com- 
puter organization and architecture design. At the relevant point in the book, an icon indicates 
that a relevant interactive simulation is available online for student use. Because the animations 
enable the user to set initial conditions, they can serve as the basis for student assignments. The 
instructor’s supplement includes a set of assignments, one for each of the animations. Each 
assignment includes a several specific problems that can be assigned to students. 


WHAT?’S NEW IN THE EIGHTH EDITION 


In the four years since the seventh edition of this book was published, the field has seen 
continued innovations and improvements. In this new edition, I try to capture these 
changes while maintaining a broad and comprehensive coverage of the entire field. To 
begin this process of revision, the seventh edition of this book was extensively reviewed by 
a number of professors who teach the subject and by professionals working in the field. 
The result is that, in many places, the narrative has been clarified and tightened, and illus- 
trations have been improved. Also, a number of new “field-tested” homework problems 
have been added. 

Beyond these refinements to improve pedagogy and user friendliness, there have been 
substantive changes throughout the book. Roughly the same chapter organization has been 
retained, but much of the material has been revised and new material has been added. The 
most noteworthy changes are as follows: 


e Interactive simulation: Simulation provides a powerful tool for understanding the 
complex mechanisms of a modern processor. The eighth edition incorporates 20 sepa- 
rate interactive, Web-based simulation tools covering such areas as cache memory, 
main memory, I/O, branch prediction, instruction pipelining, and vector processing. At 
appropriate places in the book, the simulators are highlighted so that the student can 
invoke the simulation at the proper point in studying the book. 
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¢ Embedded processors: The eighth edition now includes coverage of embedded proces- 
sors and the unique design issues they present. The ARM architecture is used as a 

case study. 

Multicore processors: The eighth edition now includes coverage of what has become 
the most prevalent new development in computer architecture: the use of multiple 
processors on a single chip. Chapter 18 is devoted to this topic. 

Cache memory: Chapter 4, which is devoted to cache memory, has been extensively 
revised, updated, and expanded to provide broader technical coverage and im- 
proved pedagogy through the use of numerous figures, as well as interactive simula- 
tion tools. 

Performance assessment: Chapter 2 includes a significantly expanded discussion of 
performance assessment, including a new discussion of benchmarks and an analysis of 
Amdahl’s law. 

Assembly language: A new appendix has been added that covers assembly language 
and assemblers. 

Programmable logic devices: The discussion of PLDs in Chapter 20 on digital logic has 
been expanded to include an introduction to field-programmable gate arrays 
(FPGAs). 

DDR SDRAM: DDR has become the dominant main memory technology in desktops 
and servers, particularly DDR2 and DDR3. DDR technology is covered in Chapter 5, 
with additional details in Appendix K. 

Linear tape open (LTO): LTO has become the best selling “super tape” format and is 
widely used with small and large computer systems, especially for backup, LTO is cov- 
ered in Chapter 6, with additional details in Appendix J. 


With each new edition it is a struggle to maintain a reasonable page count while adding 
new material. In part this objective is realized by eliminating obsolete material and tighten- 
ing the narrative. For this edition, chapters and appendices that are of less general interest 
have been moved online, as individual PDF files. This has allowed an expansion of material 
without the corresponding increase in size and price. 
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This book, with its accompanying Web site, covers a lot of material. In this chapter, we 
give the reader an overview. 


0.1 OUTLINE OF THE BOOK 


The book is organized into five parts: 


Part One: Provides an overview of computer organization and architecture 
and looks at how computer design has evolved. 


Part Two: Examines the major components of a computer and their intercon- 
nections, both with each other and the outside world. This part also includes a 
detailed discussion of internal and external memory and of input-output 
(I/O). Finally, the relationship between a computer’s architecture and the op- 
erating system running on that architecture is examined. 


Part Three: Examines the internal architecture and organization of the proces- 
sor. This part begins with an extended discussion of computer arithmetic. Then 
it looks at the instruction set architecture. The remainder of the part deals with 
the structure and function of the processor, including a discussion of reduced 
instruction set computer (RISC) and superscalar approaches. 


Part Four: Discusses the internal structure of the processor’s control unit and 
the use of microprogramming. 


Part Five: Deals with parallel organization, including symmetric multiprocess- 
ing, clusters, and multicore architecture. 


A number of online chapters and appendices at this book’s Web site cover 
additional topics relevant to the book. 

A more detailed, chapter-by-chapter summary of each part appears at the 
beginning of that part. 

This text is intended to acquaint you with the design principles and implemen- 
tation issues of contemporary computer organization and architecture. Accordingly, 
a purely conceptual or theoretical treatment would be inadequate. This book uses 
examples from a number of different machines to clarify and reinforce the concepts 
being presented. Many, but by no means all, of the examples are drawn from two 
computer families: the Intel x86 family and the ARM (Advanced RISC Machine) 
family. These two systems together encompass most of the current computer design 
trends. The Intel x86 architecture is essentially a complex instruction set computer 
(CISC) with some RISC features, while the ARM is essentially a RISC. Both sys- 
tems make use of superscalar design principles and both support multiple processor 
and multicore configurations. 


0.2 A ROADMAP FOR READERS AND INSTRUCTORS 


This book follows a top-down approach to the presentation of the material. As we 
discuss in more detail in Section 1.2, a computer system can be viewed as a hierar- 
chical structure. At a top level, we are concerned with the major components of 
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the computers: processor, I/O, memory, peripheral devices. Part Two examines 
these components and looks in some detail at each component except the proces- 
sor. This approach allows us to see the external functional requirements that drive 
the processor design, setting the stage for Part Three. In Part Three, we examine 
the processor in great detail. Because we have the context provided by Part Two, 
we are able, in Part Three, to see the design decisions that must be made so that 
the processor supports the overall function of the computer system. Next, in Part 
Four, we look at the control unit, which is at the heart of the processor. Again, the 
design of the control unit can best be explained in the context of the function it 
performs within the context of the processor. Finally, Part Five examines systems 
with multiple processors, including clusters, multiprocessor computers, and multi- 
core computers. 


WHY STUDY COMPUTER ORGANIZATION 


AND ARCHITECTURE? 


The IEEE/ACM Computer Curricula 2001, prepared by the Joint Task Force on 
Computing Curricula of the IEEE (Institute of Electrical and Electronics Engineers) 
Computer Society and ACM (Association for Computing Machinery), lists computer 
architecture as one of the core subjects that should be in the curriculum of all stu- 
dents in computer science and computer engineering. The report says the following: 


The computer lies at the heart of computing. Without it most of 
the computing disciplines today would be a branch of theoretical 
mathematics. To be a professional in any field of computing today, 
one should not regard the computer as just a black box that exe- 
cutes programs by magic. All students of computing should acquire 
some understanding and appreciation of a computer system’s func- 
tional components, their characteristics, their performance, and 
their interactions. There are practical implications as well. Students 
need to understand computer architecture in order to structure a 
program so that it runs more efficiently on a real machine. In se- 
lecting a system to use, they should be able to understand the 
tradeoff among various components, such as CPU clock speed vs. 
memory size. 


A more recent publication of the task force, Computer Engineering 2004 
Curriculum Guidelines, emphasized the importance of Computer Architecture and 
Organization as follows: 


Computer architecture is a key component of computer engineer- 
ing and the practicing computer engineer should have a practical 
understanding of this topic. It is concerned with all aspects of the 
design and organization of the central processing unit and the inte- 
gration of the CPU into the computer system itself. Architecture 
extends upward into computer software because a processor’s 
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architecture must cooperate with the operating system and system 
software. It is difficult to design an operating system well without 
knowledge of the underlying architecture. Moreover, the computer 
designer must have an understanding of software in order to imple- 
ment the optimum architecture. 

The computer architecture curriculum has to achieve multi- 
ple objectives. It must provide an overview of computer architec- 
ture and teach students the operation of a typical computing 
machine. It must cover basic principles, while acknowledging the 
complexity of existing commercial systems. Ideally, it should rein- 
force topics that are common to other areas of computer engineer- 
ing; for example, teaching register indirect addressing reinforces 
the concept of pointers in C. Finally, students must understand how 
various peripheral devices interact with, and how they are inter- 
faced to a CPU. 


[CLEM00] gives the following examples as reasons for studying computer 
architecture: 


1. Suppose a graduate enters the industry and is asked to select the most cost- 
effective computer for use throughout a large organization. An understanding 
of the implications of spending more for various alternatives, such as a larger 
cache or a higher processor clock rate, is essential to making the decision. 


2. Many processors are not used in PCs or servers but in embedded systems. A de- 
signer may program a processor in C that is embedded in some real-time or 
larger system, such as an intelligent automobile electronics controller. Debugging 
the system may require the use of a logic analyzer that displays the relationship 
between interrupt requests from engine sensors and machine-level code. 

3. Concepts used in computer architecture find application in other courses. In 
particular, the way in which the computer provides architectural support for 
programming languages and operating system facilities reinforces concepts 
from those areas. 


As can be seen by perusing the table of contents of this book, computer orga- 
nization and architecture encompasses a broad range of design issues and concepts. 
A good overall understanding of these concepts will be useful both in other areas of 
study and in future work after graduation. 


0.4 INTERNET AND WEB RESOURCES 


There are a number of resources available on the Internet and the Web that support 
this book and help readers keep up with developments in this field. 


Web Sites for This Book 


There is a Web page for this book at WilliamStallings.com/COA/COA8e.html. See the 
layout at the beginning of this book for a detailed description of that site. 
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An errata list for this book will be maintained at the Web site and updated as 
needed. Please e-mail any errors that you spot to me. Errata sheets for my other 
books are at WilliamStallings.com. 

I also maintain the Computer Science Student Resource Site, at WilliamStallings 
.com/StudentSupport.html. The purpose of this site is to provide documents, informa- 
tion, and links for computer science students and professionals. Links and docu- 
ments are organized into six categories: 


e Math: Includes a basic math refresher, a queuing analysis primer, a number 
system primer, and links to numerous math sites. 


e How-to: Advice and guidance for solving homework problems, writing techni- 
cal reports, and preparing technical presentations. 


e Research resources: Links to important collections of papers, technical re- 
ports, and bibliographies. 


e Miscellaneous: A variety of other useful documents and links. 


e Computer science careers: Useful links and documents for those considering a 
career in computer science. 


e Humor and other diversions: You have to take your mind off your work once 
in a while. 


Other Web Sites 


There are numerous Web sites that provide information related to the topics of this 
book. In subsequent chapters, lists of specific Web sites can be found in the 
Recommended Reading and Web Sites section. Because the addresses for Web sites 
tend to change frequently, the book does not provide URLs. For all of the Web sites 
listed in the book, the appropriate link can be found at this book’s Web site. Other 
links not mentioned in this book will be added to the Web site over time. 

The following are Web sites of general interest related to computer organiza- 
tion and architecture: 


¢ WWW Computer Architecture Home Page: A comprehensive index to infor- 
mation relevant to computer architecture researchers, including architecture 
groups and projects, technical organizations, literature, employment, and com- 
mercial information 


e CPU Info Center: Information on specific processors, including technical pa- 
pers, product information, and latest announcements 


e Processor Emporium: Interesting and useful collection of information 


e ACM Special Interest Group on Computer Architecture: Information on 
SIGARCH activities and publications 


e IEEE Technical Committee on Computer Architecture: Copies of TCAA 
newsletter 


USENET Newsgroups 


A number of USENET newsgroups are devoted to some aspect of computer orga- 
nization and architecture. As with virtually all USENET groups, there is a high 
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noise-to-signal ratio, but it is worth experimenting to see if any meet your needs. The 
most relevant are as follows: 


¢ comp.arch: A general newsgroup for discussion of computer architecture. 
Often quite good. 
¢ comp.arch.arithmetic: Discusses computer arithmetic algorithms and standards. 


° comp.arch.storage: Discussion ranges from products to technology to practical 
usage issues. 


© comp.parallel: Discusses parallel computers and applications. 
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PART ONE 


P.1 ISSUES FOR PART ONE 


The purpose of Part One is to provide a background and context for the remainder 
of this book. The fundamental concepts of computer organization and architecture 
are presented. 


ROAD MAP FOR PART ONE 


Chapter 1 Introduction 


Chapter 1 introduces the concept of the computer as a hierarchical system. 
A computer can be viewed as a structure of components and its function 
described in terms of the collective function of its cooperating components. 
Each component, in turn, can be described in terms of its internal structure 
and function. The major levels of this hierarchical view are introduced. The 
remainder of the book is organized, top down, using these levels. 


Chapter 2 Computer Evolution and Performance 


Chapter 2 serves two purposes. First, a discussion of the history of com- 
puter technology is an easy and interesting way of being introduced to the 
basic concepts of computer organization and architecture. The chapter 
also addresses the technology trends that have made performance the 
focus of computer system design and previews the various techniques and 
strategies that are used to achieve balanced, efficient performance. 


INTRODUCTION 


1.1 Organization and Architecture 


1.2 Structure and Function 


Function 
Structure 


1.3 Key Terms and Review Questions 
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This book is about the structure and function of computers. Its purpose is to present, as 
clearly and completely as possible, the nature and characteristics of modern-day com- 
puters. This task is a challenging one for two reasons. 

First, there is a tremendous variety of products, from single-chip microcomputers 
costing a few dollars to supercomputers costing tens of millions of dollars, that can 
rightly claim the name computer. Variety is exhibited not only in cost, but also in size, 
performance, and application. Second, the rapid pace of change that has always charac- 
terized computer technology continues with no letup. These changes cover all aspects 
of computer technology, from the underlying integrated circuit technology used to con- 
struct computer components to the increasing use of parallel organization concepts in 
combining those components. 

In spite of the variety and pace of change in the computer field, certain funda- 
mental concepts apply consistently throughout. To be sure, the application of these con- 
cepts depends on the current state of technology and the price/performance objectives 
of the designer. The intent of this book is to provide a thorough discussion of the funda- 
mentals of computer organization and architecture and to relate these to contemporary 
computer design issues. This chapter introduces the descriptive approach to be taken. 


1.1 ORGANIZATION AND ARCHITECTURE 


In describing computers, a distinction is often made between computer architecture and 
computer organization. Although it is difficult to give precise definitions for these 
terms, a consensus exists about the general areas covered by each (e.g., see [VRAN80], 
[SIEW82], and [BELL78a]); an interesting alternative view is presented in [REDD76]. 


Examples of 
architectural attributes include the instruction set, the number of bits used to repre- 
sent various data types (e.g., numbers, characters), I/O mechanisms, and techniques 
for addressing memory. Organizational attributes include those hardware details 
transparent to the programmer, such as control signals; interfaces between the com- 
puter and peripherals; and the memory technology used. 

For example, it is an architectural design issue whether a computer will have a 
multiply instruction. It is an organizational issue whether that instruction will be im- 
plemented by a special multiply unit or by a mechanism that makes repeated use of 
the add unit of the system. The organizational decision may be based on the antici- 
pated frequency of use of the multiply instruction, the relative speed of the two ap- 
proaches, and the cost and physical size of a special multiply unit. 

Historically, and still today, the distinction between architecture and organiza- 
tion has been an important one. Many computer manufacturers offer a family of 
computer models, all with the same architecture but with differences in organization. 
Consequently, the different models in the family have different price and perfor- 
mance characteristics. Furthermore, a particular architecture may span many years 
and encompass a number of different computer models, its organization changing 
with changing technology. A prominent example of both these phenomena is the 
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IBM System/370 architecture. This architecture was first introduced in 1970 and in- 
cluded a number of models. The customer with modest requirements could buy a 
cheaper, slower model and, if demand increased, later upgrade to a more expensive, 
faster model without having to abandon software that had already been developed. 
Over the years, IBM has introduced many new models with improved technology to 
replace older models, offering the customer greater speed, lower cost, or both. These 
newer models retained the same architecture so that the customer’s software invest- 
ment was protected. Remarkably, the System/370 architecture, with a few enhance- 
ments, has survived to this day as the architecture of IBM’s mainframe product line. 

In a class of computers called microcomputers, the relationship between archi- 
tecture and organization is very close. Changes in technology not only influence or- 
ganization but also result in the introduction of more powerful and more complex 
architectures. Generally, there is less of a requirement for generation-to-generation 
compatibility for these smaller machines. Thus, there is more interplay between or- 
ganizational and architectural design decisions. An intriguing example of this is the 
reduced instruction set computer (RISC), which we examine in Chapter 13. 

This book examines both computer organization and computer architecture. 
The emphasis is perhaps more on the side of organization. However, because a com- 
puter organization must be designed to implement a particular architectural specifi- 
cation, a thorough treatment of organization requires a detailed examination of 
architecture as well. 


1.2 STRUCTURE AND FUNCTION 


ow, then, can one clearly describe them? The key is to rec- 

ognize the hierarchical nature of most complex systems, including the computer 

[SIMO96]. A hierarchical system is a set of interrelated subsystems, each of the latter, in 

turn, hierarchical in structure until we reach some lowest level of elementary subsystem. 

The hierarchical nature of complex systems is essential to both their design and 

their description. The designer need only deal with a particular level of the system at 

a time. At each level, the system consists of a set of components and their interrela- 

tionships. The behavior at each level depends only on a simplified, abstracted charac- 

terization of the system at the next lower level. At each level, the designer is 
concerned with structure and function: 


‘© Structure: The way in which the components are interrelated 
uae Le ocean 2 EDS OS UO ES LG 

In terms of description, we have two choices: starting at the bottom and build- 
ing up to a complete description, or beginning with a top view and decomposing the 
system into its subparts. Evidence from a number of fields suggests that the top- 
down approach is the clearest and most effective [WEIN75]. 

The approach taken in this book follows from this viewpoint. The computer 
system will be described from the top down. We begin with the major components of 
a computer, describing their structure and function, and proceed to successively 
lower layers of the hierarchy. The remainder of this section provides a very brief 
overview of this plan of attack. 
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Figure 1.1 A Functional View of the Computer 


Function 


Both the structure and functioning of a computer are, in essence, simple. Figure 1.1 
depicts the basic functions that a computer can perform. In general terms, there are 


The computer, of course, must be able to process data. The data may take a wide 
variety of forms, and the range of processing requirements is broad. However, we shall 
see that there are only a few fundamental methods or types of data processing. 

It is also essential that a computer store data. Even if the computer is processing 
data on the fly (i.e., data come in and get processed, and the results go out immedi- 
ately), the computer must temporarily store at least those pieces of data that are being 
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Storage Processing 


(a) (b) 


Storage Processing Storage Processing 


(c) (d) 


Figure 1.2 Possible Computer Operations 


worked on at any given moment. Thus, there is at least a short-term data storage func- 
tion. Equally important, the computer performs a long-term data storage function. 
Files of data are stored on the computer for subsequent retrieval and update. 

The computer must be able to move data between itself and the outside world. 
The computer’s operating environment consists of devices that serve as either 
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sources or destinations of data 


Finally, there must be control of these three functions. Ultimately, this control 
is exercised by the individual(s) who provides the computer with instructions. Within 
the computer, a control unit manages the computer’s resources and orchestrates the 
performance of its functional parts in response to those instructions. 

At this general level of discussion, the number of possible operations that can 
be performed is few. Figure 1.2 depicts the four possible types of operations. The 
computer can function as a data movement device (Figure 1.2a), simply transferring 
data from one peripheral or communications line to another. It can also function as 
a data storage device (Figure 1.2b), with data transferred from the external environ- 
ment to computer storage (read) and vice versa (write). The final two diagrams 
show operations involving data processing, on data either in storage (Figure 1.2c) or 
en route between storage and the external environment (Figure 1.2d). 

The preceding discussion may seem absurdly generalized. It is certainly possi- 
ble, even at a top level of computer structure, to differentiate a variety of functions, 
but, to quote [SIE W82], 


There is remarkably little shaping of computer structure to fit the 
function to be performed. At the root of this lies the general-purpose 
nature of computers, in which all the functional specialization occurs 
at the time of programming and not at the time of design. 


Structure 


Figure 1.3 is the simplest possible depiction of a computer. The computer interacts 
in some fashion with its external environment. In general, all of its linkages to the 
external environment can be classified as peripheral devices or communication 
lines. We will have something to say about both types of linkages. 


COMPUTER 


¢ Storage 
¢ Processing 


Figure 1.3 The Computer 
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Figure 1.4 The Computer: Top-Level Structure 


But of greater concern in this book is the internal structure of the computer 


itself, which is shown in Figure 1.4. There are four main structural components: 
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interconnection is by means of a system bus, consisting of a number of con- 
ducting wires to which all the other components attach. 


There may be one or more of each of the aforementioned components. Tradi- 
tionally, there has been just a single processor. In recent years, there has been in- 
creasing use of multiple processors in a single computer. Some design issues relating 
to multiple processors crop up and are discussed as the text proceeds; Part Five 
focuses on such computers. 

Each of these components will be examined in some detail in Part Two. How- 
ever, for our purposes, the most interesting and in some ways the most complex 


component is the CPU. Its major structural components are as follows: 


Each of these components will be examined in some detail in Part Three, where we 
will see that complexity is added by the use of parallel and pipelined organizational 
techniques. Finally, there are several approaches to the implementation of the con- 
trol unit; one common approach is a microprogrammed implementation. In essence, 
a microprogrammed control unit operates by executing microinstructions that define 
the functionality of the control unit. With this approach, the structure of the control 
unit can be depicted, as in Figure 1.4. This structure will be examined in Part Four. 
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Key Terms 


arithmetic and logic unit computer organization processor 
(ALU) control unit registers 


central processing unit (CPU) | input-output (I/O) system bus 
computer architecture main memory 


Review Questions 


1.1. What, in general terms, is the distinction between computer organization and com- 
puter architecture? 


1.2. What, in general terms, is the distinction between computer structure and computer 
function? 


1.3. What are the four main functions of a computer? 
1.4. List and briefly define the main structural components of a computer. 
1.5. List and briefly define the main structural components of a processor. 
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KEY POINTS 


® The evolution of computers has been characterized by increasing processor 
speed, decreasing component size, increasing memory size, and increasing 
I/O capacity and speed. 


® One factor responsible for the great increase in processor speed is the 
shrinking size of microprocessor components; this reduces the distance be- 
tween components and hence increases speed. However, the true gains in 
speed in recent years have come from the organization of the processor, in- 
cluding heavy use of pipelining and parallel execution techniques and the 
use of speculative execution techniques (tentative execution of future in- 
structions that might be needed). All of these techniques are designed to 
keep the processor busy as much of the time as possible. 


® A critical issue in computer system design is balancing the performance of 
the various elements so that gains in performance in one area are not hand- 
icapped by a lag in other areas. In particular, processor speed has increased 
more rapidly than memory access time. A variety of techniques is used to 
compensate for this mismatch, including caches, wider data paths from 
memory to processor, and more intelligent memory chips. 


We begin our study of computers with a brief history. This history is itself interest- 
ing and also serves the purpose of providing an overview of computer structure 
and function. Next, we address the issue of performance. A consideration of the 
need for balanced utilization of computer resources provides a context that is use- 
ful throughout the book. Finally, we look briefly at the evolution of the two sys- 
tems that serve as key examples throughout the book: the Intel x86 and ARM 
processor families. 


2.1 A BRIEF HISTORY OF COMPUTERS 


The First Generation: Vacuum Tubes 


ENIAC The ENIAC (Electronic Numerical Integrator And Computer), designed 
and constructed at the University of Pennsylvania, was the world’s first general- 
purpose electronic digital computer. The project was a response to U.S. needs during 
World War II. The Army’s Ballistics Research Laboratory (BRL), an agency respon- 
sible for developing range and trajectory tables for new weapons, was having diffi- 
culty supplying these tables accurately and within a reasonable time frame. Without 
these firing tables, the new weapons and artillery were useless to gunners. The BRL 
employed more than 200 people who, using desktop calculators, solved the neces- 
sary ballistics equations. Preparation of the tables for a single weapon would take 
one person many hours, even days. 
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John Mauchly, a professor of electrical engineering at the University of 
Pennsylvania, and John Eckert, one of his graduate students, proposed to build a 
general-purpose computer using vacuum tubes for the BRL’s application. In 1943, 
the Army accepted this proposal, and work began on the ENIAC. The resulting 
machine was enormous, weighing 30 tons, occupying 1500 square feet of floor 
space, and containing more than 18,000 vacuum tubes. When operating, it con- 
sumed 140 kilowatts of power. It was also substantially faster than any electro- 
mechanical computer, capable of 5000 additions per second. 

The ENIAC was a decimal rather than a binary machine. That is, numbers 
were represented in decimal form, and arithmetic was performed in the decimal sys- 
tem. Its memory consisted of 20 “accumulators,” each capable of holding a 10-digit 
decimal number. A ring of 10 vacuum tubes represented each digit. At any time, 
only one vacuum tube was in the ON state, representing one of the 10 digits. The 
major drawback of the ENIAC was that it had to be programmed manually by set- 
ting switches and plugging and unplugging cables. 

The ENIAC was completed in 1946, too late to be used in the war effort. In- 
stead, its first task was to perform a series of complex calculations that were used to 
help determine the feasibility of the hydrogen bomb. The use of the ENIAC for a 
purpose other than that for which it was built demonstrated its general-purpose 
nature. The ENIAC continued to operate under BRL management until 1955, when 
it was disassembled. 


THE VON NEUMANN MACHINE The task of entering and altering programs for the 
ENIAC was extremely tedious. The programming process could be facilitated if the 
program could be represented in a form suitable for storing in memory alongside 
the data. Then, a computer could get its instructions by reading them from memory, 
and a program could be set or altered by setting the values of a portion of memory. 

This idea, known as the stored-program concept, is usually attributed to the 
ENIAC designers, most notably the mathematician John von Neumann, who was a 
consultant on the ENIAC project. Alan Turing developed the idea at about the same 
time. The first publication of the idea was in a 1945 proposal by von Neumann for a 
new computer, the EDVAC (Electronic Discrete Variable Computer). 

In 1946, von Neumann and his colleagues began the design of a new stored- 
program computer, referred to as the IAS computer, at the Princeton Institute for 
Advanced Studies. The IAS computer, although not completed until 1952, is the pro- 
totype of all subsequent general-purpose computers. 

Figure 2.1 shows the general structure of the [AS computer (compare to mid- 
dle portion of Figure 1.4). It consists of 


e A main memory, which stores both data and instructions! 


e An arithmetic and logic unit (ALU) capable of operating on binary data 


'In this book, unless otherwise noted, the term instruction refers to a machine instruction that is 
directly interpreted and executed by the processor, in contrast to an instruction in a high-level lan- 
guage, such as Ada or C++, which must first be compiled into a series of machine instructions before 
being executed. 
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Figure 2.1 Structure of the [AS Computer 


e Acontrol unit, which interprets the instructions in memory and causes them to 
be executed 


e Input and output (I/O) equipment operated by the control unit 


This structure was outlined in von Neumann’s earlier proposal, which is worth 
quoting at this point [VONN45]: 


2.2 First: Because the device is primarily a computer, it will 
have to perform the elementary operations of arithmetic most fre- 
quently. These are addition, subtraction, multiplication and divi- 
sion. It is therefore reasonable that it should contain specialized 
organs for just these operations. 

It must be observed, however, that while this principle as 
such is probably sound, the specific way in which it is realized re- 
quires close scrutiny. At any rate a central arithmetical part of the 
device will probably have to exist and this constitutes the first spe- 
cific part: CA. 

2.3 Second: The logical control of the device, that is, the 
proper sequencing of its operations, can be most efficiently carried 
out by acentral control organ. If the device is to be elastic, that is, as 
nearly as possible all purpose, then a distinction must be made be- 
tween the specific instructions given for and defining a particular 
problem, and the general control organs which see to it that these 
instructions—no matter what they are—are carried out. The for- 
mer must be stored in some way; the latter are represented by def- 
inite operating parts of the device. By the central control we mean 
this latter function only, and the organs which perform it form the 
second specific part: CC. 
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2.4 Third: Any device which is to carry out long and compli- 
cated sequences of operations (specifically of calculations) must 
have a considerable memory... 

(b) The instructions which govern a complicated problem 
may constitute considerable material, particularly so, if the code is 
circumstantial (which it is in most arrangements). This material 
must be remembered. 

At any rate, the total memory constitutes the third specific 
part of the device: M. 

2.6 The three specific parts CA, CC (together C), and M cor- 
respond to the associative neurons in the human nervous system. It 
remains to discuss the equivalents of the sensory or afferent and the 
motor or efferent neurons. These are the input and output organs of 
the device. 

The device must be endowed with the ability to maintain 
input and output (sensory and motor) contact with some specific 
medium of this type. The medium will be called the outside record- 
ing medium of the device: R. 

2.7 Fourth: The device must have organs to transfer . . . infor- 
mation from R into its specific parts C and M. These organs form 
its input, the fourth specific part: I. It will be seen that it is best to 
make all transfers from R (by I) into M and never directly from C. 

2.8 Fifth: The device must have organs to transfer . . . from its 
specific parts C and M into R. These organs form its output, the fifth 
specific part: O. It will be seen that it is again best to make all trans- 
fers from M (by O) into R, and never directly from C. 


With rare exceptions, all of today’s computers have this same general structure 
and function and are thus referred to as von Neumann machines. Thus, it is worth- 
while at this point to describe briefly the operation of the [AS computer [BURK46]. 
Following [HAYE98], the terminology and notation of von Neumann are changed 
in the following to conform more closely to modern usage; the examples and illus- 
trations accompanying this discussion are based on that latter text. 

The memory of the IAS consists of 1000 storage locations, called words, of 
40 binary digits (bits) each.? Both data and instructions are stored there. Numbers 
are represented in binary form, and each instruction is a binary code. Figure 2.2 
illustrates these formats. Each number is represented by a sign bit and a 39-bit value. 
A word may also contain two 20-bit instructions, with each instruction consisting of 
an 8-bit operation code (opcode) specifying the operation to be performed and a 
12-bit address designating one of the words in memory (numbered from 0 to 999). 

The control unit operates the IAS by fetching instructions from memory and 
executing them one at a time. To explain this, a more detailed structure diagram is 


There is no universal definition of the term word. In general, a word is an ordered set of bytes or bits that 
is the normal unit in which information may be stored, transmitted, or operated on within a given com- 
puter. Typically, if a processor has a fixed-length instruction set, then the instruction length equals the 
word length. 
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Figure 2.2. [AS Memory Formats 


needed, as indicated in Figure 2.3. This figure reveals that both the control unit and 
the ALU contain storage locations, called registers, defined as follows: 


Memory buffer register (MBR): Contains a word to be stored in memory or 
sent to the I/O unit, or is used to receive a word from memory or from the 
V/O unit. 

Memory address register (MAR): Specifies the address in memory of the 
word to be written from or read into the MBR. 

Instruction register (IR): Contains the 8-bit opcode instruction being exe- 
cuted. 

Instruction buffer register (IBR): Employed to hold temporarily the right- 
hand instruction from a word in memory. 

Program counter (PC): Contains the address of the next instruction-pair to be 
fetched from memory. 

Accumulator (AC) and multiplier quotient (MQ): Employed to hold tem- 
porarily operands and results of ALU operations. For example, the result of 
multiplying two 40-bit numbers is an 80-bit number; the most significant 
40 bits are stored in the AC and the least significant in the MQ. 


The IAS operates by repetitively performing an instruction cycle, as shown in 


Figure 2.4. Each instruction cycle consists of two subcycles. During the fetch cycle, 
the opcode of the next instruction is loaded into the IR and the address portion is 
loaded into the MAR. This instruction may be taken from the IBR, or it can be ob- 
tained from memory by loading a word into the MBR, and then down to the IBR, 
IR, and MAR. 


Why the indirection? These operations are controlled by electronic circuitry 


and result in the use of data paths. To simplify the electronics, there is only one 
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Figure 2.3 Expanded Structure of IAS Computer 


register that is used to specify the address in memory for a read or write and only 
one register used for the source or destination. 

Once the opcode is in the IR, the execute cycle is performed. Control circuitry in- 
terprets the opcode and executes the instruction by sending out the appropriate con- 
trol signals to cause data to be moved or an operation to be performed by the ALU. 

The IAS computer had a total of 21 instructions, which are listed in Table 2.1. 
These can be grouped as follows: 


e Data transfer: Move data between memory and ALU registers or between two 
ALU registers. 
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Figure 2.4 Partial Flowchart of IAS Operation 


e Unconditional branch: Normally, the control unit executes instructions in se- 
quence from memory. This sequence can be changed by a branch instruction, 
which facilitates repetitive operations. 


e Conditional branch: The branch can be made dependent on a condition, thus 
allowing decision points. 


e Arithmetic: Operations performed by the ALU. 


e Address modify: Permits addresses to be computed in the ALU and then in- 
serted into instructions stored in memory. This allows a program considerable 
addressing flexibility. 
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Table 2.1 The IAS Instruction Set 


Instruction Symbolic 
Type Opcode __ Representation Description 


00001010 LOAD MQ Transfer contents of register MQ to the accumulator AC 
00001001 LOAD MQ,M(X) Transfer contents of memory location X to MO 
00100001 STOR M(x) Transfer contents of accumulator to memory location X 
Data transfer 00000001 LOAD M(X) Transfer M(X) to the accumulator 
00000010 LOAD —M(X) Transfer —M(X) to the accumulator 
00000011 LOAD |M(X)| Transfer absolute value of M(X) to the accumulator 
00000100 LOAD —|M(xX)| Transfer —|M(X)| to the accumulator 
Unconditional 90001101 JUMP M(X,0:19) Take next instruction from left half of M(X) 
branch 00001110 JUMP M(X,20:39) Take next instruction from right half of M(X) 


00001111 JUMP+ M(X,0:19) If number in the accumulator is nonnegative, take next in- 
Conditional struction from left half of M(X) 


branch 00010000 JUMP+ M(X,20:39) If number in the accumulator is nonnegative, take next 


instruction from right half of M(X) 
00000101 ADD M(X) Add M(X) to AC; put the result in AC 
00000111 ADD |M(X)| Add |M(X)| to AC; put the result in AC 
00000110 SUB M(X) Subtract M(X) from AC; put the result in AC 
00001000 SUB |M(X)| Subtract |M(X)| from AC; put the remainder in AC 


Arithmetic 00001011 MUL M(X) Multiply MCX) by MQ; put most significant bits of result 
in AC, put least significant bits in MQ 


00001100 DIV M(x) Divide AC by M(X); put the quotient in MQ and the 
remainder in AC 


00010100 LSH Multiply accumulator by 2;.e., shift left one bit position 
00010101 RSH Divide accumulator by 2; i.e., shift right one position 


00010010 STOR M(X,8:19) Replace left address field at M(X) by 12 rightmost bits 
Address of AC 


modify 00010011 STOR M(X,28:39) Replace right address field at M(X) by 12 rightmost 
bits of AC 


Table 2.1 presents instructions in a symbolic, easy-to-read form. Actually, each 
instruction must conform to the format of Figure 2.2b. The opcode portion (first 
8 bits) specifies which of the 21 instructions is to be executed. The address portion 
(remaining 12 bits) specifies which of the 1000 memory locations is to be involved in 
the execution of the instruction. 

Figure 2.4 shows several examples of instruction execution by the control unit. 
Note that each operation requires several steps. Some of these are quite elaborate. 
The multiplication operation requires 39 suboperations, one for each bit position ex- 
cept that of the sign bit. 


COMMERCIAL COMPUTERS The 1950s saw the birth of the computer industry with 
two companies, Sperry and IBM, dominating the marketplace. 
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In 1947, Eckert and Mauchly formed the Eckert-Mauchly Computer Corpora- 
tion to manufacture computers commercially. Their first successful machine was the 
UNIVAC I (Universal Automatic Computer), which was commissioned by the 
Bureau of the Census for the 1950 calculations. The Eckert-Mauchly Computer Cor- 
poration became part of the UNIVAC division of Sperry-Rand Corporation, which 
went on to build a series of successor machines. 

The UNIVAC I was the first successful commercial computer. It was intended 
for both scientific and commercial applications. The first paper describing the sys- 
tem listed matrix algebraic computations, statistical problems, premium billings 
for a life insurance company, and logistical problems as a sample of the tasks it could 
perform. 

The UNIVAC II, which had greater memory capacity and higher performance 
than the UNIVAC I, was delivered in the late 1950s and illustrates several trends that 
have remained characteristic of the computer industry. First, advances in technology 
allow companies to continue to build larger, more powerful computers. Second, each 
company tries to make its new machines backward compatible’ with the older ma- 
chines. This means that the programs written for the older machines can be executed 
on the new machine. This strategy is adopted in the hopes of retaining the customer 
base; that is, when a customer decides to buy a newer machine, he or she is likely to 
get it from the same company to avoid losing the investment in programs. 

The UNIVAC division also began development of the 1100 series of comput- 
ers, which was to be its major source of revenue. This series illustrates a distinction 
that existed at one time. The first model, the UNIVAC 1103, and its successors for 
many years were primarily intended for scientific applications, involving long and 
complex calculations. Other companies concentrated on business applications, which 
involved processing large amounts of text data. This split has largely disappeared, 
but it was evident for a number of years. 

IBM, then the major manufacturer of punched-card processing equipment, de- 
livered its first electronic stored-program computer, the 701, in 1953. The 701 was in- 
tended primarily for scientific applications [BASH81]. In 1955, IBM introduced the 
companion 702 product, which had a number of hardware features that suited it to 
business applications. These were the first of a long series of 700/7000 computers 
that established IBM as the overwhelmingly dominant computer manufacturer. 


The Second Generation: Transistors 


The first major change in the electronic computer came with the replacement of the 
vacuum tube by the transistor. The transistor is smaller, cheaper, and dissipates less 
heat than a vacuum tube but can be used in the same way as a vacuum tube to con- 
struct computers. Unlike the vacuum tube, which requires wires, metal plates, a glass 
capsule, and a vacuum, the transistor is a solid-state device, made from silicon. 

The transistor was invented at Bell Labs in 1947 and by the 1950s had launched 
an electronic revolution. It was not until the late 1950s, however, that fully transis- 
torized computers were commercially available. IBM again was not the first 


3Also called downward compatible. The same concept, from the point of view of the older system, is 
referred to as upward compatible, or forward compatible. 
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Table 2.2 Computer Generations 


Approximate Typical Speed 
Generation Dates Technology (operations per second) 


1946-1957 Vacuum tube 40,000 
1958-1964 Transistor 200,000 


1965-1971 Small and medium scale 1,000,000 
integration 


1972-1977 Large scale integration 10,000,000 
1978-1991 Very large scale integration 100,000,000 
1991— Ultra large scale integration 1,000,000,000 


company to deliver the new technology. NCR and, more successfully, RCA were the 
front-runners with some small transistor machines. IBM followed shortly with the 
7000 series. 

The use of the transistor defines the second generation of computers. It has be- 
come widely accepted to classify computers into generations based on the fundamen- 
tal hardware technology employed (Table 2.2). Each new generation is characterized 
by greater processing performance, larger memory capacity, and smaller size than the 
previous one. 

But there are other changes as well. The second generation saw the introduc- 
tion of more complex arithmetic and logic units and control units, the use of high- 
level programming languages, and the provision of system software with the 
computer. 

The second generation is noteworthy also for the appearance of the Digital 
Equipment Corporation (DEC). DEC was founded in 1957 and, in that year, deliv- 
ered its first computer, the PDP-1. This computer and this company began the mini- 
computer phenomenon that would become so prominent in the third generation. 


THE IBM 7094 From the introduction of the 700 series in 1952 to the introduction 
of the last member of the 7000 series in 1964, this IBM product line underwent an 
evolution that is typical of computer products. Successive members of the product 
line show increased performance, increased capacity, and/or lower cost. 

Table 2.3 illustrates this trend. The size of main memory, in multiples of 2" 36-bit 
words, grew from 2K (1K = 2") to 32K words,’ while the time to access one word of 
memory, the memory cycle time, fell from 30 ps to 1.4 ws. The number of opcodes 
grew from a modest 24 to 185. 

The final column indicates the relative execution speed of the central process- 
ing unit (CPU). Speed improvements are achieved by improved electronics (e.g., a 
transistor implementation is faster than a vacuum tube implementation) and more 
complex circuitry. For example, the IBM 7094 includes an Instruction Backup Reg- 
ister, used to buffer the next instruction. The control unit fetches two adjacent words 


4A discussion of the uses of numerical prefixes, such as kilo and giga, is contained in a supporting docu- 
ment at the Computer Science Student Resource Site at WilliamStallings.com/StudentSupport.html. 
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Figure 2.5 An IBM 7094 Configuration 


from memory for an instruction fetch. Except for the occurrence of a branching in- 
struction, which is typically infrequent, this means that the control unit has to access 
memory for an instruction on only half the instruction cycles. This prefetching sig- 
nificantly reduces the average instruction cycle time. 

The remainder of the columns of Table 2.3 will become clear as the text proceeds. 

Figure 2.5 shows a large (many peripherals) configuration for an IBM 7094, 
which is representative of second-generation computers [BELL71]. Several differ- 
ences from the IAS computer are worth noting. The most important of these is the 
use of data channels. A data channel is an independent I/O module with its own 
processor and its own instruction set. In a computer system with such devices, the 
CPU does not execute detailed I/O instructions. Such instructions are stored in a 
main memory to be executed by a special-purpose processor in the data channel it- 
self. The CPU initiates an I/O transfer by sending a control signal to the data channel, 
instructing it to execute a sequence of instructions in memory. The data channel per- 
forms its task independently of the CPU and signals the CPU when the operation is 
complete. This arrangement relieves the CPU of a considerable processing burden. 

Another new feature is the multiplexor, which is the central termination point for 
data channels, the CPU, and memory. The multiplexor schedules access to the memory 
from the CPU and data channels, allowing these devices to act independently. 


The Third Generation: Integrated Circuits 


A single, self-contained transistor is called a discrete component. Throughout the 
1950s and early 1960s, electronic equipment was composed largely of discrete 
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components — transistors, resistors, capacitors, and so on. Discrete components were 
manufactured separately, packaged in their own containers, and soldered or wired 
together onto masonite-like circuit boards, which were then installed in computers, 
oscilloscopes, and other electronic equipment. Whenever an electronic device called 
for a transistor, a little tube of metal containing a pinhead-sized piece of silicon had 
to be soldered to a circuit board. The entire manufacturing process, from transistor 
to circuit board, was expensive and cumbersome. 

These facts of life were beginning to create problems in the computer industry. 
Early second-generation computers contained about 10,000 transistors. This figure 
grew to the hundreds of thousands, making the manufacture of newer, more power- 
ful machines increasingly difficult. 

In 1958 came the achievement that revolutionized electronics and started the 
era of microelectronics: the invention of the integrated circuit. It is the integrated 
circuit that defines the third generation of computers. In this section we provide a 
brief introduction to the technology of integrated circuits. Then we look at perhaps 
the two most important members of the third generation, both of which were intro- 
duced at the beginning of that era: the IBM System/360 and the DEC PDP-8. 


MICROELECTRONICS Microelectronics means, literally, “small electronics.” Since 
the beginnings of digital electronics and the computer industry, there has been a 
persistent and consistent trend toward the reduction in size of digital electronic cir- 
cuits. Before examining the implications and benefits of this trend, we need to say 
something about the nature of digital electronics. A more detailed discussion is 
found in Chapter 20. 

The basic elements of a digital computer, as we know, must perform storage, 
movement, processing, and control functions. Only two fundamental types of com- 
ponents are required (Figure 2.6): gates and memory cells. A gate is a device that im- 
plements a simple Boolean or logical function, such as IF A AND B ARE TRUE 
THEN CIS TRUE (AND gate). Such devices are called gates because they control 
data flow in much the same way that canal gates do. The memory cell is a device that 
can store one bit of data; that is, the device can be in one of two stable states at any 
time. By interconnecting large numbers of these fundamental devices, we can con- 
struct a computer. We can relate this to our four basic functions as follows: 


e Data storage: Provided by memory cells. 
e Data processing: Provided by gates. 
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Figure 2.6 Fundamental Computer Elements 
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e Data movement: The paths among components are used to move data from 
memory to memory and from memory through gates to memory. 


e Control: The paths among components can carry control signals. For example, 
a gate will have one or two data inputs plus a control signal input that activates 
the gate. When the control signal is ON, the gate performs its function on the 
data inputs and produces a data output. Similarly, the memory cell will store 
the bit that is on its input lead when the WRITE control signal is ON and will 
place the bit that is in the cell on its output lead when the READ control sig- 
nal is ON. 


Thus, a computer consists of gates, memory cells, and interconnections among 
these elements. The gates and memory cells are, in turn, constructed of simple digi- 
tal electronic components. 

The integrated circuit exploits the fact that such components as transistors, re- 
sistors, and conductors can be fabricated from a semiconductor such as silicon. It is 
merely an extension of the solid-state art to fabricate an entire circuit in a tiny piece 
of silicon rather than assemble discrete components made from separate pieces of 
silicon into the same circuit. Many transistors can be produced at the same time on 
a single wafer of silicon. Equally important, these transistors can be connected with 
a process of metallization to form circuits. 

Figure 2.7 depicts the key concepts in an integrated circuit. A thin wafer of 
silicon is divided into a matrix of small areas, each a few millimeters square. The 
identical circuit pattern is fabricated in each area, and the wafer is broken up into 
chips. Each chip consists of many gates and/or memory cells plus a number of input 
and output attachment points. This chip is then packaged in housing that protects it 
and provides pins for attachment to devices beyond the chip. A number of these 
packages can then be interconnected on a printed circuit board to produce larger 
and more complex circuits. 

Initially, only a few gates or memory cells could be reliably manufactured and 
packaged together. These early integrated circuits are referred to as small-scale in- 
tegration (SSI). As time went on, it became possible to pack more and more com- 
ponents on the same chip. This growth in density is illustrated in Figure 2.8; it is one 
of the most remarkable technological trends ever recorded.” This figure reflects the 
famous Moore’s law, which was propounded by Gordon Moore, cofounder of Intel, 
in 1965 [MOOR65]. Moore observed that the number of transistors that could be 
put on a single chip was doubling every year and correctly predicted that this pace 
would continue into the near future. To the surprise of many, including Moore, 
the pace continued year after year and decade after decade. The pace slowed to a 
doubling every 18 months in the 1970s but has sustained that rate ever since. 

The consequences of Moore’s law are profound: 


1. The cost of a chip has remained virtually unchanged during this period of 
rapid growth in density. This means that the cost of computer logic and mem- 
ory circuitry has fallen at a dramatic rate. 


Note that the vertical axis uses a log scale. A basic review of log scales is in the math refresher document 
at the Computer Science Student Support Site at WilliamStallings.com/StudentSupport.html. 
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Because logic and memory elements are placed closer together on more densely 
packed chips, the electrical path length is shortened, increasing operating speed. 


io) 


The computer becomes smaller, making it more convenient to place in a variety 
of environments. 


4. There is a reduction in power and cooling requirements. 


5. The interconnections on the integrated circuit are much more reliable than 
solder connections. With more circuitry on each chip, there are fewer interchip 
connections. 


IBM sysTEM/360 By 1964, IBM had a firm grip on the computer market with its 
7000 series of machines. In that year, IBM announced the System/360, a new family 
of computer products. Although the announcement itself was no surprise, it con- 
tained some unpleasant news for current IBM customers: the 360 product line was 
incompatible with older IBM machines. Thus, the transition to the 360 would be dif- 
ficult for the current customer base. This was a bold step by IBM, but one IBM felt 
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Figure 2.8 Growth in CPU Transistor Count [BOHR03] 


was necessary to break out of some of the constraints of the 7000 architecture and to 
produce a system capable of evolving with the new integrated circuit technology 
[PADE81, GIFF87]. The strategy paid off both financially and technically. The 360 
was the success of the decade and cemented IBM as the overwhelmingly dominant 
computer vendor, with a market share above 70%. And, with some modifications and 
extensions, the architecture of the 360 remains to this day the architecture of IBM’s 
mainframe® computers. Examples using this architecture can be found throughout 
this text. 

The System/360 was the industry’s first planned family of computers. The fam- 
ily covered a wide range of performance and cost. Table 2.4 indicates some of the 
key characteristics of the various models in 1965 (each member of the family is dis- 
tinguished by a model number). The models were compatible in the sense that a 
program written for one model should be capable of being executed by another 
model in the series, with only a difference in the time it takes to execute. 

The concept of a family of compatible computers was both novel and ex- 
tremely successful. A customer with modest requirements and a budget to match 
could start with the relatively inexpensive Model 30. Later, if the customer’s needs 
grew, it was possible to upgrade to a faster machine with more memory without 


The term mainframe is used for the larger, most powerful computers other than supercomputers. Typical 
characteristics of a mainframe are that it supports a large database, has elaborate I/O hardware, and is 
used in a central data processing facility. 
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Table 2.4 Key Characteristics of the System/360 Family 


Characteristic 


Maximum memory size (bytes) 


Data rate from memory (Mbytes/sec) 


Processor cycle time ps) 


Relative speed 
Maximum number of data channels 


Maximum data rate on one channel 
(Kbytes/s) 


sacrificing the investment in already-developed software. The characteristics of a 
family are as follows: 


e Similar or identical instruction set: In many cases, the exact same set of ma- 
chine instructions is supported on all members of the family. Thus, a program 
that executes on one machine will also execute on any other. In some cases, the 
lower end of the family has an instruction set that is a subset of that of the top 
end of the family. This means that programs can move up but not down. 


e Similar or identical operating system: The same basic operating system is 
available for all family members. In some cases, additional features are added 
to the higher-end members. 


e Increasing speed: The rate of instruction execution increases in going from 
lower to higher family members. 


e Increasing number of I/O ports: The number of I/O ports increases in going 
from lower to higher family members. 


e Increasing memory size: The size of main memory increases in going from 
lower to higher family members. 


e Increasing cost: At a given point in time, the cost of a system increases in going 
from lower to higher family members. 


How could such a family concept be implemented? Differences were achieved 
based on three factors: basic speed, size, and degree of simultaneity [STEV64]. For 
example, greater speed in the execution of a given instruction could be gained by 
the use of more complex circuitry in the ALU, allowing suboperations to be carried 
out in parallel. Another way of increasing speed was to increase the width of the 
data path between main memory and the CPU. On the Model 30, only 1 byte (8 bits) 
could be fetched from main memory at a time, whereas 8 bytes could be fetched at a 
time on the Model 75. 

The System/360 not only dictated the future course of IBM but also had a pro- 
found impact on the entire industry. Many of its features have become standard on 
other large computers. 


DEC PDP-S In the same year that IBM shipped its first System/360, another 
momentous first shipment occurred: PDP-8 from Digital Equipment Corporation 
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(DEC). At a time when the average computer required an air-conditioned room, the 
PDP-8 (dubbed a minicomputer by the industry, after the miniskirt of the day) was 
small enough that it could be placed on top of a lab bench or be built into other 
equipment. It could not do everything the mainframe could, but at $16,000, it was 
cheap enough for each lab technician to have one. In contrast, the System/360 series 
of mainframe computers introduced just a few months before cost hundreds of 
thousands of dollars. 

The low cost and small size of the PDP-8 enabled another manufacturer to 
purchase a PDP-8 and integrate it into a total system for resale. These other manu- 
facturers came to be known as original equipment manufacturers (OEMs), and the 
OEM market became and remains a major segment of the computer marketplace. 

The PDP-8 was an immediate hit and made DEC’s fortune. This machine and 
other members of the PDP-8 family that followed it (see Table 2.5) achieved a pro- 
duction status formerly reserved for IBM computers, with about 50,000 machines 
sold over the next dozen years. As DEC’s official history puts it, the PDP-8 “estab- 
lished the concept of minicomputers, leading the way to a multibillion dollar indus- 
try.” It also established DEC as the number one minicomputer vendor, and, by the 
time the PDP-8 had reached the end of its useful life, DEC was the number two 
computer manufacturer, behind IBM. 

In contrast to the central-switched architecture (Figure 2.5) used by IBM on 
its 700/7000 and 360 systems, later models of the PDP-8 used a structure that is now 
virtually universal for microcomputers: the bus structure. This is illustrated in 
Figure 2.9. The PDP-8 bus, called the Omnibus, consists of 96 separate signal paths, 
used to carry control, address, and data signals. Because all system components 
share a common set of signal paths, their use must be controlled by the CPU. This ar- 
chitecture is highly flexible, allowing modules to be plugged into the bus to create 
various configurations. 


Later Generations 


Beyond the third generation there is less general agreement on defining generations 
of computers. Table 2.2 suggests that there have been a number of later generations, 
based on advances in integrated circuit technology. With the introduction of large- 
scale integration (LSI), more than 1000 components can be placed on a single inte- 
grated circuit chip. Very-large-scale integration (VLSI) achieved more than 10,000 
components per chip, while current ultra-large-scale integration (ULSI) chips can 
contain more than one million components. 

With the rapid pace of technology, the high rate of introduction of new prod- 
ucts, and the importance of software and communications as well as hardware, the 
classification by generation becomes less clear and less meaningful. It could be said 
that the commercial application of new developments resulted in a major change in 
the early 1970s and that the results of these changes are still being worked out. In 
this section, we mention two of the most important of these results. 


SEMICONDUCTOR MEMORY The first application of integrated circuit technology 
to computers was construction of the processor (the control unit and the arithmetic 
and logic unit) out of integrated circuit chips. But it was also found that this same 
technology could be used to construct memories. 


ce 


Table 2.5 


Model 


Evolution of the PDP-8 [VOEL88] 


Cost of Processor + 4K 
12-bit Words of 
Memory ($1000s) 


Data Rate 
from Memory 
(words/zsec) 


‘Volume 
(cubic feet) 


Innovations and Improvements 


PDP-8 


16.2 


1.26 


8.0 


Automatic wire-wrapping production 


PDP-8/5 


8.79 


0.08 


3.2 


Serial instruction implementation 


PDP-8/1 


11.6 


1.34 


8.0 


Medium scale integrated circuits 


PDP-8/L 


7.0 


1.26 


2.0 


Smaller cabinet 


PDP-8/E 


4.99 


ile 


De) 


Omnibus 


PDP-8/M 


3.69 


ise 


1.8 


Half-size cabinet with fewer slots than 8/E 


PDP-8/A 


2.6 


1.34 


2 


Semiconductor memory; floating-point processor 
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Figure 2.9 PDP-8 Bus Structure 


In the 1950s and 1960s, most computer memory was constructed from tiny 
rings of ferromagnetic material, each about a sixteenth of an inch in diameter. These 
rings were strung up on grids of fine wires suspended on small screens inside the 
computer. Magnetized one way, a ring (called a core) represented a one; magnetized 
the other way, it stood for a zero. Magnetic-core memory was rather fast; it took as 
little as a millionth of a second to read a bit stored in memory. But it was expensive, 
bulky, and used destructive readout: The simple act of reading a core erased the data 
stored in it. It was therefore necessary to install circuits to restore the data as soon as 
it had been extracted. 

Then, in 1970, Fairchild produced the first relatively capacious semiconductor 
memory. This chip, about the size of a single core, could hold 256 bits of memory. It 
was nondestructive and much faster than core. It took only 70 billionths of a second 
to read a bit. However, the cost per bit was higher than for that of core. 

In 1974, a seminal event occurred: The price per bit of semiconductor memory 
dropped below the price per bit of core memory. Following this, there has been a con- 
tinuing and rapid decline in memory cost accompanied by a corresponding increase in 
physical memory density. This has led the way to smaller, faster machines with mem- 
ory sizes of larger and more expensive machines from just a few years earlier. Devel- 
opments in memory technology, together with developments in processor technology 
to be discussed next, changed the nature of computers in less than a decade. Although 
bulky, expensive computers remain a part of the landscape, the computer has also 
been brought out to the “end user,” with office machines and personal computers. 

Since 1970, semiconductor memory has been through 13 generations: 1K, 4K, 
16K, 64K, 256K, 1M, 4M, 16M, 64M, 256M, 1G, 4G, and, as of this writing, 16 Gbits 
on a single chip (1K = 2",1M = 2”,1G = 2). Each generation has provided four 
times the storage density of the previous generation, accompanied by declining cost 
per bit and declining access time. 


MICROPROCESSORS Just as the density of elements on memory chips has continued 
to rise, so has the density of elements on processor chips. As time went on, more and 
more elements were placed on each chip, so that fewer and fewer chips were needed 
to construct a single computer processor. 

A breakthrough was achieved in 1971, when Intel developed its 4004. The 4004 
was the first chip to contain all of the components of a CPU on a single chip: The mi- 
croprocessor was born. 

The 4004 can add two 4-bit numbers and can multiply only by repeated addi- 
tion. By today’s standards, the 4004 is hopelessly primitive, but it marked the begin- 
ning of a continuing evolution of microprocessor capability and power. 
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This evolution can be seen most easily in the number of bits that the processor 
deals with at a time. There is no clear-cut measure of this, but perhaps the best mea- 
sure is the data bus width: the number of bits of data that can be brought into or sent 
out of the processor at a time. Another measure is the number of bits in the accu- 
mulator or in the set of general-purpose registers. Often, these measures coincide, 
but not always. For example, a number of microprocessors were developed that op- 
erate on 16-bit numbers in registers but can only read and write 8 bits at a time. 

The next major step in the evolution of the microprocessor was the introduc- 
tion in 1972 of the Intel 8008. This was the first 8-bit microprocessor and was almost 
twice as complex as the 4004. 

Neither of these steps was to have the impact of the next major event: the in- 
troduction in 1974 of the Intel 8080. This was the first general-purpose microproces- 
sor. Whereas the 4004 and the 8008 had been designed for specific applications, the 
8080 was designed to be the CPU of a general-purpose microcomputer. Like the 
8008, the 8080 is an 8-bit microprocessor. The 8080, however, is faster, has a richer 
instruction set, and has a large addressing capability. 

About the same time, 16-bit microprocessors began to be developed. How- 
ever, it was not until the end of the 1970s that powerful, general-purpose 16-bit mi- 
croprocessors appeared. One of these was the 8086. The next step in this trend 
occurred in 1981, when both Bell Labs and Hewlett-Packard developed 32-bit, sin- 
gle-chip microprocessors. Intel introduced its own 32-bit microprocessor, the 80386, 
in 1985 (Table 2.6). 


Table 2.6 Evolution of Intel Microprocessors 
(a) 1970s Processors 


Introduced 


Clock speeds 108 kHz 108 kHz 5 MHz, 8 MHz, 10 MHz 5 MHz, 8 MHz 
Bus width 4 bits 8 bits 16 bits 8 bits 
Number of transistors 2,300 3,500 29,000 29,000 


Feature size (um) 10 3 6 
Addressable memory 640 Bytes 16 KB 64 KB 


(b) 1980s Processors 


80286 386TM DX 386TM SX 486TM DX CPU 


Introduced 1982 1985 1988 1989 
Clock speeds 6 MHz-12.5 MHz 16 MHz-33 MHz 16 MHz-33 MHz 25 MHz-50 MHz 
Bus width 16 bits 32 bits 16 bits 32 bits 


Number of transistors 134,000 275,000 275,000 1.2 million 
Feature size (um) 1.5 1 1 0.8-1 
Addressable memory 16 MB 4 GB 16 MB 4 GB 
Virtual memory 1 GB 64 TB 64 TB 64 TB 
Cache _ — - 8kB 
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Table 2.6 Continued 
(c) 1990s Processors 


486T™ SX Pentium Pentium Pro Pentium II 


Introduced 1991 1993 1995 1997 
Clock speeds 16 MHz-33 MHz | 60 MHz-166 MHz, | 150 MHz-200 MHz | 200 MHz-300 MHz 
Bus width 32 bits 32 bits 64 bits 64 bits 


Number of transistors 1.185 million 3.1 million 5.5 million 7.5 million 


Feature size (um) 0.8 0.6 0.35 
Addressable memory 64 GB 64 GB 
Virtual memory 64 TB 64 TB 


Cache 512 kB L1 and 512 kB L2 
1 MBL2 


(d) Recent Processors 


Pentium II Pentium 4 Core 2 Duo Core 2 Quad 


Introduced 1999 2000 2006 2008 
Clock speeds 450-660 MHz 1.3-1.8 GHz 1.06-1.2 GHz 3 GHz 
Bus sidth 64 bits 64 bits 64 bits 64 bits 


Number of transistors 9.5 million 42 million 167 million 820 million 
Feature size (nm) 250 180 65 45 
Addressable memory 64 GB 64 GB 64 GB 64 GB 
Virtual memory 64 TB 64 TB 64TB 64 TB 
512 kB L2 256 kB L2 


2.2 DESIGNING FOR PERFORMANCE 


Year by year, the cost of computer systems continues to drop dramatically, while the 
performance and capacity of those systems continue to rise equally dramatically. At 
a local warehouse club, you can pick up a personal computer for less than $1000 that 
packs the wallop of an IBM mainframe from 10 years ago. Thus, we have virtually 
“free” computer power. And this continuing technological revolution has enabled 
the development of applications of astounding complexity and power. For example, 
desktop applications that require the great power of today’s microprocessor-based 
systems include 


e Image processing 
e Speech recognition 
e Videoconferencing 


Multimedia authoring 
e Voice and video annotation of files 


e Simulation modeling 
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Workstation systems now support highly sophisticated engineering and scien- 
tific applications, as well as simulation systems, and have the ability to support 
image and video applications. In addition, businesses are relying on increasingly 
powerful servers to handle transaction and database processing and to support 
massive client/server networks that have replaced the huge mainframe computer 
centers of yesteryear. 

What is fascinating about all this from the perspective of computer organiza- 
tion and architecture is that, on the one hand, the basic building blocks for today’s 
computer miracles are virtually the same as those of the [AS computer from over 
50 years ago, while on the other hand, the techniques for squeezing the last iota of 
performance out of the materials at hand have become increasingly sophisticated. 

This observation serves as a guiding principle for the presentation in this book. 
As we progress through the various elements and components of a computer, two 
objectives are pursued. First, the book explains the fundamental functionality in 
each area under consideration, and second, the book explores those techniques re- 
quired to achieve maximum performance. In the remainder of this section, we high- 
light some of the driving factors behind the need to design for performance. 


Microprocessor Speed 


What gives Intel x86 processors or IBM mainframe computers such mind-boggling 
power is the relentless pursuit of speed by processor chip manufacturers. The evolu- 
tion of these machines continues to bear out Moore’s law, mentioned previously. So 
long as this law holds, chipmakers can unleash a new generation of chips every three 
years—with four times as many transistors. In memory chips, this has quadrupled 
the capacity of dynamic random-access memory (DRAM), still the basic technology 
for computer main memory, every three years. In microprocessors, the addition of 
new circuits, and the speed boost that comes from reducing the distances between 
them, has improved performance four- or fivefold every three years or so since Intel 
launched its x86 family in 1978. 

But the raw speed of the microprocessor will not achieve its potential unless it 
is fed a constant stream of work to do in the form of computer instructions. Any- 
thing that gets in the way of that smooth flow undermines the power of the proces- 
sor. Accordingly, while the chipmakers have been busy learning how to fabricate 
chips of greater and greater density, the processor designers must come up with ever 
more elaborate techniques for feeding the monster. Among the techniques built 
into contemporary processors are the following: 


e Branch prediction: The processor looks ahead in the instruction code fetched 
from memory and predicts which branches, or groups of instructions, are likely 
to be processed next. If the processor guesses right most of the time, it can 
prefetch the correct instructions and buffer them so that the processor is kept 
busy. The more sophisticated examples of this strategy predict not just the next 
branch but multiple branches ahead. Thus, branch prediction increases the 
amount of work available for the processor to execute. 


e Data flow analysis: The processor analyzes which instructions are dependent 
on each other’s results, or data, to create an optimized schedule of instructions. 
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In fact, instructions are scheduled to be executed when ready, independent of 
the original program order. This prevents unnecessary delay. 


e Speculative execution: Using branch prediction and data flow analysis, some 
processors speculatively execute instructions ahead of their actual appearance 
in the program execution, holding the results in temporary locations. This en- 
ables the processor to keep its execution engines as busy as possible by exe- 
cuting instructions that are likely to be needed. 


These and other sophisticated techniques are made necessary by the sheer power 
of the processor. They make it possible to exploit the raw speed of the processor. 


Performance Balance 


While processor power has raced ahead at breakneck speed, other critical compo- 
nents of the computer have not kept up. The result is a need to look for performance 
balance: an adjusting of the organization and architecture to compensate for the 
mismatch among the capabilities of the various components. 

Nowhere is the problem created by such mismatches more critical than in the 
interface between processor and main memory. Consider the history depicted in 
Figure 2.10. While processor speed has grown rapidly, the speed with which data can 
be transferred between main memory and the processor has lagged badly. The inter- 
face between processor and main memory is the most crucial pathway in the entire 
computer because it is responsible for carrying a constant flow of program instruc- 
tions and data between memory chips and the processor. If memory or the pathway 
fails to keep pace with the processor’s insistent demands, the processor stalls in a 
wait state, and valuable processing time is lost. 
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Figure 2.10 Logic and Memory Performance Gap [BORKO3] 
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There are a number of ways that a system architect can attack this problem, all 
of which are reflected in contemporary computer designs. Consider the following 
examples: 


e Increase the number of bits that are retrieved at one time by making DRAMs 


“wider” rather than “deeper” and by using wide bus data paths. 


e Change the DRAM interface to make it more efficient by including a cache’ 


or other buffering scheme on the DRAM chip. 


e Reduce the frequency of memory access by incorporating increasingly com- 
plex and efficient cache structures between the processor and main memory. 
This includes the incorporation of one or more caches on the processor chip as 
well as on an off-chip cache close to the processor chip. 


e Increase the interconnect bandwidth between processors and memory by 
using higher-speed buses and by using a hierarchy of buses to buffer and struc- 
ture data flow. 


Another area of design focus is the handling of I/O devices. As computers be- 
come faster and more capable, more sophisticated applications are developed that 
support the use of peripherals with intensive I/O demands. Figure 2.11 gives some 


Gigabit Ethernet 


Graphics display 


Hard disk 


Ethernet 


Optical disk 


Scanner 


Laser printer 


Floppy disk 


Modem 


Mouse 


Keyboard 


10! 10? 10° 104 10° 10° 10’ 10° 10° 
Data rate (bps) 
Figure 2.11 Typical I/O Device Data Rates 


TA cache is a relatively small fast memory interposed between a larger, slower memory and the logic that 
accesses the larger memory. The cache holds recently accessed data, and is designed to speed up subse- 
quent access to the same data. Caches are discussed in Chapter 4. 
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examples of typical peripheral devices in use on personal computers and worksta- 
tions. These devices create tremendous data throughput demands. While the current 
generation of processors can handle the data pumped out by these devices, there re- 
mains the problem of getting that data moved between processor and peripheral. 
Strategies here include caching and buffering schemes plus the use of higher-speed 
interconnection buses and more elaborate structures of buses. In addition, the use of 
multiple-processor configurations can aid in satisfying I/O demands. 

The key in all this is balance. Designers constantly strive to balance the 
throughput and processing demands of the processor components, main memory, 
I/O devices, and the interconnection structures. This design must constantly be 
rethought to cope with two constantly evolving factors: 


e The rate at which performance is changing in the various technology areas 
(processor, buses, memory, peripherals) differs greatly from one type of ele- 
ment to another. 


e New applications and new peripheral devices constantly change the nature of 
the demand on the system in terms of typical instruction profile and the data 
access patterns. 


Thus, computer design is a constantly evolving art form. This book attempts to 
present the fundamentals on which this art form is based and to present a survey of 
the current state of that art. 


Improvements in Chip Organization and Architecture 


As designers wrestle with the challenge of balancing processor performance with that 
of main memory and other computer components, the need to increase processor 
speed remains. There are three approaches to achieving increased processor speed: 


e Increase the hardware speed of the processor. This increase is fundamentally 
due to shrinking the size of the logic gates on the processor chip, so that more 
gates can be packed together more tightly and to increasing the clock rate. 
With gates closer together, the propagation time for signals is significantly re- 
duced, enabling a speeding up of the processor. An increase in clock rate 
means that individual operations are executed more rapidly. 


e Increase the size and speed of caches that are interposed between the proces- 
sor and main memory. In particular, by dedicating a portion of the processor 
chip itself to the cache, cache access times drop significantly. 


e Make changes to the processor organization and architecture that increase the 
effective speed of instruction execution. Typically, this involves using paral- 
lelism in one form or another. 


Traditionally, the dominant factor in performance gains has been in increases 
in clock speed due and logic density. Figure 2.12 illustrates this trend for Intel 
processor chips. However, as clock speed and logic density increase, a number of ob- 
stacles become more significant [INTE04b]: 


e Power: As the density of logic and the clock speed on a chip increase, so does 
the power density (Watts/cm7). The difficulty of dissipating the heat generated 
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Figure 2.12 Intel Microprocessor Performance [GIBB04] 


on high-density, high-speed chips is becoming a serious design issue ([GIBB04], 
[BORK03)}). 


e RC delay: The speed at which electrons can flow on a chip between transis- 
tors is limited by the resistance and capacitance of the metal wires connecting 
them; specifically, delay increases as the RC product increases. As compo- 
nents on the chip decrease in size, the wire interconnects become thinner, in- 
creasing resistance. Also, the wires are closer together, increasing capacitance. 


e Memory latency: Memory speeds lag processor speeds, as previously discussed. 


Thus, there will be more emphasis on organization and architectural ap- 
proaches to improving performance. Figure 2.12 highlights the major changes that 
have been made over the years to increase the parallelism and therefore the 
computational efficiency of processors. These techniques are discussed in later 
chapters of the book. 

Beginning in the late 1980s, and continuing for about 15 years, two main strate- 
gies have been used to increase performance beyond what can be achieved simply 
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by increasing clock speed. First, there has been an increase in cache capacity. There 
are now typically two or three levels of cache between the processor and main mem- 
ory. As chip density has increased, more of the cache memory has been incorporated 
on the chip, enabling faster cache access. For example, the original Pentium chip de- 
voted about 10% of on-chip area to a cache. The most recent Pentium 4 chip devotes 
about half of the chip area to caches. 

Second, the instruction execution logic within a processor has become in- 
creasingly complex to enable parallel execution of instructions within the proces- 
sor. Two noteworthy design approaches have been pipelining and superscalar. A 
pipeline works much as an assembly line in a manufacturing plant enabling differ- 
ent stages of execution of different instructions to occur at the same time along the 
pipeline. A superscalar approach in essence allows multiple pipelines within a sin- 
gle processor so that instructions that do not depend on one another can be exe- 
cuted in parallel. 

Both of these approaches are reaching a point of diminishing returns. The in- 
ternal organization of contemporary processors is exceedingly complex and is able 
to squeeze a great deal of parallelism out of the instruction stream. It seems likely 
that further significant increases in this direction will be relatively modest 
[GIBB04]. With three levels of cache on the processor chip, each level providing 
substantial capacity, it also seems that the benefits from the cache are reaching 
a limit. 

However, simply relying on increasing clock rate for increased performance 
runs into the power dissipation problem already referred to. The faster the clock 
rate, the greater the amount of power to be dissipated, and some fundamental phys- 
ical limits are being reached. 

With all of these difficulties in mind, designers have turned to a fundamentally 
new approach to improving performance: placing multiple processors on the same 
chip, with a large shared cache. The use of multiple processors on the same chip, also 
referred to as multiple cores, or multicore, provides the potential to increase perfor- 
mance without increasing the clock rate. Studies indicate that, within a processor, the 
increase in performance is roughly proportional to the square root of the increase in 
complexity [BORKO03]. But if the software can support the effective use of multiple 
processors, then doubling the number of processors almost doubles performance. 
Thus, the strategy is to use two simpler processors on the chip rather than one more 
complex processor. 

In addition, with two processors, larger caches are justified. This is important 
because the power consumption of memory logic on a chip is much less than that of 
processing logic. In coming years, we can expect that most new processor chips will 
have multiple processors. 


2.3 THE EVOLUTION OF THE INTEL x86 ARCHITECTURE 


Throughout this book, we rely on many concrete examples of computer design and 
implementation to illustrate concepts and to illuminate trade-offs. Most of the time, 
the book relies on examples from two computer families: the Intel x86 and the 
ARM architecture. The current x86 offerings represent the results of decades of 
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design effort on complex instruction set computers (CISCs). The x86 incorporates 
the sophisticated design principles once found only on mainframes and supercom- 
puters and serves as an excellent example of CISC design. An alternative approach 
to processor design in the reduced instruction set computer (RISC). The ARM ar- 
chitecture is used in a wide variety of embedded systems and is one of the most 
powerful and best-designed RISC-based systems on the market. 

In this section and the next, we provide a brief overview of these two systems. 

In terms of market share, Intel has ranked as the number one maker of micro- 
processors for non-embedded systems for decades, a position it seems unlikely to 
yield. The evolution of its flagship microprocessor product serves as a good indica- 
tor of the evolution of computer technology in general. 

Table 2.6 shows that evolution. Interestingly, as microprocessors have grown 
faster and much more complex, Intel has actually picked up the pace. Intel used to 
develop microprocessors one after another, every four years. But Intel hopes to 
keep rivals at bay by trimming a year or two off this development time, and has 
done so with the most recent x86 generations. 

It is worthwhile to list some of the highlights of the evolution of the Intel prod- 
uct line: 


8080: The world’s first general-purpose microprocessor. This was an 8-bit ma- 
chine, with an 8-bit data path to memory. The 8080 was used in the first per- 
sonal computer, the Altair. 


8086: A far more powerful, 16-bit machine. In addition to a wider data path 
and larger registers, the 8086 sported an instruction cache, or queue, that 
prefetches a few instructions before they are executed. A variant of this 
processor, the 8088, was used in IBM’s first personal computer, securing the 
success of Intel. The 8086 is the first appearance of the x86 architecture. 


80286: This extension of the 8086 enabled addressing a 16-MByte memory in- 
stead of just 1 MByte. 


80386: Intel’s first 32-bit machine, and a major overhaul of the product. With a 
32-bit architecture, the 80386 rivaled the complexity and power of minicom- 
puters and mainframes introduced just a few years earlier. This was the first 
Intel processor to support multitasking, meaning it could run multiple pro- 
grams at the same time. 


80486: The 80486 introduced the use of much more sophisticated and powerful 
cache technology and sophisticated instruction pipelining. The 80486 also of- 
fered a built-in math coprocessor, offloading complex math operations from 
the main CPU. 


Pentium: With the Pentium, Intel introduced the use of superscalar tech- 
niques, which allow multiple instructions to execute in parallel. 


Pentium Pro: The Pentium Pro continued the move into superscalar organiza- 
tion begun with the Pentium, with aggressive use of register renaming, branch 
prediction, data flow analysis, and speculative execution. 


Pentium II: The Pentium II incorporated Intel MMX technology, which is de- 
signed specifically to process video, audio, and graphics data efficiently. 
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e Pentium II: The Pentium HI incorporates additional floating-point instruc- 
tions to support 3D graphics software. 


e Pentium 4: The Pentium 4 includes additional floating-point and other en- 
hancements for multimedia.® 


e Core: This is the first Intel x86 microprocessor with a dual core, referring to 
the implementation of two processors on a single chip. 


e Core 2: The Core 2 extends the architecture to 64 bits. The Core 2 Quad pro- 
vides four processors on a single chip. 


Over 30 years after its introduction in 1978, the x86 architecture continues to 
dominate the processor market outside of embedded systems. Although the organiza- 
tion and technology of the x86 machines has changed dramatically over the decades, 
the instruction set architecture has evolved to remain backward compatible with ear- 
lier versions. Thus, any program written on an older version of the x86 architecture can 
execute on newer versions. All changes to the instruction set architecture have involved 
additions to the instruction set, with no subtractions. The rate of change has been the 
addition of roughly one instruction per month added to the architecture over the 
30 years [ANTHO8], so that there are now over 500 instructions in the instruction set. 

The x86 provides an excellent illustration of the advances in computer hard- 
ware over the past 30 years. The 1978 8086 was introduced with a clock speed of 
5 MHz and had 29,000 transistors. A quad-core Intel Core 2 introduced in 2008 op- 
erates at 3 GHz, a speedup of a factor of 600, and has 820 million transistors, about 
28,000 times as many as the 8086. Yet the Core 2 is in only a slightly larger package 
than the 8086 and has a comparable cost. 


2.4 EMBEDDED SYSTEMS AND THE ARM 


The ARM architecture refers to a processor architecture that has evolved from 
RISC design principles and is used in embedded systems. Chapter 13 examines 
RISC design principles in detail. In this section, we give a brief overview of the con- 
cept of embedded systems, and then look at the evolution of the ARM. 


Embedded Systems 


The term embedded system refers to the use of electronics and software within a 
product, as opposed to a general-purpose computer, such as a laptop or desktop sys- 
tem. The following is a good general definition:? 


Embedded system. A combination of computer hardware and software, and perhaps 
additional mechanical or other parts, designed to perform a dedicated function. In many 
cases, embedded systems are part of a larger system or product, as in the case of an 
antilock braking system in a car. 


8With the Pentium 4, Intel switched from Roman numerals to Arabic numerals for model numbers. 


°Michael Barr, Embedded Systems Glossary. Netrino Technical Library. http://www.netrino.com/Publications/ 
Glossary/index.php 
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Table 2.7 Examples of Embedded Systems and Their Markets [NOER05] 


Market Embedded Device 


Ignition system 
Automotive Engine control 
Brake system 


Digital and analog televisions 

Set-top boxes (DVDs, VCRs, Cable boxes) 

Personal digital assistants (PDAs) 

Kitchen appliances (refrigerators, toasters, microwave ovens) 
Consumer electronics Automobiles 

Toys/games 

Telephones/cell phones/pagers 

Cameras 

Global positioning systems 


Robotics and controls systems for manufacturing 


Industrial control 
Sensors 


Infusion pumps 

Dialysis machines 
Prosthetic devices 
Cardiac monitors 


Medical 


Fax machine 
Photocopier 
Office automation Printers 
Monitors 
Scanners 


Embedded systems far outnumber general-purpose computer systems, encom- 


passing a broad range of applications (Table 2.7). These systems have widely varying 
requirements and constraints, such as the following [GRIM0S]: 


give 


Small to large systems, implying very different cost constraints, thus different 
needs for optimization and reuse 

Relaxed to very strict requirements and combinations of different quality re- 
quirements, for example, with respect to safety, reliability, real-time, flexibility, 
and legislation 

Short to long life times 

Different environmental conditions in terms of, for example, radiation, vibra- 
tions, and humidity 

Different application characteristics resulting in static versus dynamic loads, slow 


to fast speed, compute versus interface intensive tasks, and/or combinations 
thereof 


Different models of computation ranging from discrete-event systems to those 
involving continuous time dynamics (usually referred to as hybrid systems) 


Often, embedded systems are tightly coupled to their environment. This can 
rise to real-time constraints imposed by the need to interact with the envi- 


ronment. Constraints, such as required speeds of motion, required precision of 
measurement, and required time durations, dictate the timing of software operations. 
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Figure 2.13 Possible Organization of an Embedded 
System 


If multiple activities must be managed simultaneously, this imposes more complex 
real-time constraints. 

Figure 2.13, based on [KOOP96], shows in general terms an embedded system 
organization. In addition to the processor and memory, there are a number of ele- 
ments that differ from the typical desktop or laptop computer: 


e There may be a variety of interfaces that enable the system to measure, ma- 
nipulate, and otherwise interact with the external environment. 


e The human interface may be as simple as a flashing light or as complicated as 
real-time robotic vision. 


e The diagnostic port may be used for diagnosing the system that is being 
controlled—not just for diagnosing the computer. 


e Special-purpose field programmable (FPGA), application specific (ASIC), or 
even nondigital hardware may be used to increase performance or safety. 


e Software often has a fixed function and is specific to the application. 


ARM Evolution 


ARM is a family of RISC-based microprocessors and microcontrollers designed by 
ARM Inc., Cambridge, England. The company doesn’t make processors but instead 
designs microprocessor and multicore architectures and licenses them to manufac- 
turers. ARM chips are high-speed processors that are known for their small die size 
and low power requirements. They are widely used in PDAs and other handheld de- 
vices, including games and phones as well as a large variety of consumer products. 
ARM chips are the processors in Apple’s popular iPod and iPhone devices. ARM is 
probably the most widely used embedded processor architecture and indeed the 
most widely used processor architecture of any kind in the world. 

The origins of ARM technology can be traced back to the British-based Acorn 
Computers company. In the early 1980s, Acorn was awarded a contract by the 
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Table 2.8 ARM Evolution 


Typical MIPS 
Notable Features @ MHz 


32-bit RISC 


Multiply and swap instructions; Integrated 7 MIPS @ 12 MHz 
memory management unit, graphics and 
I/O processor 


ARM3 First use of processor cache 4 KB unified 12 MIPS @ 25 MHz 


ARM6 First to support 32-bit addresses; floating-point 4 KB unified 28 MIPS @ 33 MHz 
unit 


ARM7 Integrated SoC 8 KB unified 60 MIPS @ 60 MHz 
ARM8 5-stage pipeline; static branch prediction 8 KB unified 84 MIPS @ 72 MHz 
ARM9 16 KB/16 KB 300 MIPS @ 300 MHz 
ARM9E Enhanced DSP instructions 16 KB/16 KB 220 MIPS @ 200 MHz 
ARMIOE | 6-stage pipeline 32 KB/32 KB 
ARM11 9-stage pipeline Variable 740 MIPS @ 665 MHz 
Cortex 13-stage superscalar pipeline Variable 2000 MIPS @ 1 GHz 


XScale Applications processor; 7-stage pipeline 32 KB/32 KB L1 1000 MIPS @ 1.25 GHz 
512 KB L2 


DSP = digital signal processor 


SoC = system on a chip 


British Broadcasting Corporation (BBC) to develop a new microcomputer architec- 
ture for the BBC Computer Literacy Project. The success of this contract enabled 
Acorn to go on to develop the first commercial RISC processor, the Acorn RISC 
Machine (ARM). The first version, ARM1, became operational in 1985 and was 
used for internal research and development as well as being used as a coprocessor in 
the BBC machine. Also in 1985, Acorn released the ARM2, which had greater func- 
tionality and speed within the same physical space. Further improvements were 
achieved with the release in 1989 of the ARM3. 

Throughout this period, Acorn used the company VLSI Technology to do the 
actual fabrication of the processor chips. VLSI was licensed to market the chip on its 
own and had some success in getting other companies to use the ARM in their prod- 
ucts, particularly as an embedded processor. 

The ARM design matched a growing commercial need for a high-performance, 
low-power-consumption, small-size and low-cost processor for embedded applica- 
tions. But further development was beyond the scope of Acorns capabilities. 
Accordingly, a new company was organized, with Acorn, VLSI, and Apple Com- 
puter as founding partners, known as ARM Ltd. The Acorn RISC Machine became 
the Advanced RISC Machine.!° The new company’s first offering, an improvement 
on the ARM3, was designated ARM6. Subsequently, the company has introduced a 
number of new families, with increasing functionality and performance. Table 2.8 


0The company dropped the designation Advanced RISC Machine in the late 1990s. It is now simply 
known as the ARM architecture. 
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shows some characteristics of the various ARM architecture families. The numbers 
in this table are only approximate guides; actual values vary widely for different im- 
plementations. 

According to the ARM Web site arm.com, ARM processors are designed to 
meet the needs of three system categories: 


e Embedded real-time systems: Systems for storage, automotive body and 
power-train, industrial, and networking applications 


e Application platforms: Devices running open operating systems including 
Linux, Palm OS, Symbian OS, and Windows CE in wireless, consumer enter- 
tainment and digital imaging applications 


e Secure applications: Smart cards, SIM cards, and payment terminals 


2.5 PERFORMANCE ASSESSMENT 


In evaluating processor hardware and setting requirements for new systems, perfor- 
mance is one of the key parameters to consider, along with cost, size, security, relia- 
bility, and, in some cases power consumption. 

It is difficult to make meaningful performance comparisons among different 
processors, even among processors in the same family. Raw speed is far less impor- 
tant than how a processor performs when executing a given application. Unfortu- 
nately, application performance depends not just on the raw speed of the processor, 
but on the instruction set, choice of implementation language, efficiency of the com- 
piler, and skill of the programming done to implement the application. 

We begin this section with a look at some traditional measures of processor 
speed. Then we examine the most common approach to assessing processor and 
computer system performance. We follow this with a discussion of how to average 
results from multiple tests. Finally, we look at the insights produced by considering 
Amdahl’s law. 


Clock Speed and Instructions per Second 


THE SYSTEM CLOCK Operations performed by a processor, such as fetching an in- 
struction, decoding the instruction, performing an arithmetic operation, and so on, 
are governed by a system clock. Typically, all operations begin with the pulse of the 
clock. Thus, at the most fundamental level, the speed of a processor is dictated by the 
pulse frequency produced by the clock, measured in cycles per second, or Hertz (Hz). 

Typically, clock signals are generated by a quartz crystal, which generates a con- 
stant signal wave while power is applied. This wave is converted into a digital voltage 
pulse stream that is provided in a constant flow to the processor circuitry (Figure 
2.14). For example, a 1-GHz processor receives 1 billion pulses per second. The rate of 
pulses is known as the clock rate, or clock speed. One increment, or pulse, of the clock 
is referred to as a clock cycle, or a clock tick. The time between pulses is the cycle time. 

The clock rate is not arbitrary, but must be appropriate for the physical layout 
of the processor. Actions in the processor require signals to be sent from one 
processor element to another. When a signal is placed on a line inside the processor, 
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From Computer Desktop Encyclopedia, 
1998, The Computer Language Co. 


Figure 2.14 System Clock 


it takes some finite amount of time for the voltage levels to settle down so that an 
accurate value (1 or 0) is available. Furthermore, depending on the physical layout 
of the processor circuits, some signals may change more rapidly than others. Thus, 
operations must be synchronized and paced so that the proper electrical signal 
(voltage) values are available for each operation. 

The execution of an instruction involves a number of discrete steps, such as 
fetching the instruction from memory, decoding the various portions of the instruc- 
tion, loading and storing data, and performing arithmetic and logical operations. 
Thus, most instructions on most processors require multiple clock cycles to com- 
plete. Some instructions may take only a few cycles, while others require dozens. In 
addition, when pipelining is used, multiple instructions are being executed simulta- 
neously. Thus, a straight comparison of clock speeds on different processors does not 
tell the whole story about performance. 


INSTRUCTION EXECUTION RATE A processor is driven by a clock with a constant 
frequency f or, equivalently, a constant cycle time 7, where 7 = 1/f. Define the in- 
struction count, /., for a program as the number of machine instructions executed 
for that program until it runs to completion or for some defined time interval. Note 
that this is the number of instruction executions, not the number of instructions in 
the object code of the program. An important parameter is the average cycles per 
instruction CPI for a program. If all instructions required the same number of clock 
cycles, then CPJ would be a constant value for a processor. However, on any give 
processor, the number of clock cycles required varies for different types of instruc- 
tions, such as load, store, branch, and so on. Let CPI; be the number of cycles re- 
quired for instruction type i. and J; be the number of executed instructions of type i 
for a given program. Then we can calculate an overall CPI as follows: 
SCPE XD) 


CPI = 7 (2.1L) 


Cc 
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Table 2.9 Performance Factors and System Attributes 


Instruction set architecture 


Compiler technology 


Processor implementation 


Cache and memory hierarchy 


The processor time T needed to execute a given program can be expressed as 
T=I1,X CPI Xt 


We can refine this formulation by recognizing that during the execution of an 
instruction, part of the work is done by the processor, and part of the time a word is 
being transferred to or from memory. In this latter case, the time to transfer depends 
on the memory cycle time, which may be greater than the processor cycle time. We 
can rewrite the preceding equation as 


T=1,X[p+(mxXk)|] Xr 


where p is the number of processor cycles needed to decode and execute the instruc- 
tion, m is the number of memory references needed, and k is the ratio between mem- 
ory cycle time and processor cycle time. The five performance factors in the preceding 
equation (J,, p,m, k, 7) are influenced by four system attributes: the design of the in- 
struction set (known as instruction set architecture), compiler technology (how effec- 
tive the compiler is in producing an efficient machine language program from a 
high-level language program), processor implementation, and cache and memory hi- 
erarchy. Table 2.9, based on [HWAN93], is a matrix in which one dimension shows the 
five performance factors and the other dimension shows the four system attributes. 
An X in a cell indicates a system attribute that affects a performance factor. 

A common measure of performance for a processor is the rate at which in- 
structions are executed, expressed as millions of instructions per second (MIPS), re- 
ferred to as the MIPS rate. We can express the MIPS rate in terms of the clock rate 
and CPI as follows: 


a f 
MIPS rate = = (2.2) 


Tx 10° CPI x 10° 


For example, consider the execution of a program which results in the execu- 
tion of 2 million instructions on a 400-MHz processor. The program consists of four 
major types of instructions. The instruction mix and the CPI for each instruction 
type are given below based on the result of a program trace experiment: 


Instruction Type CPI Instruction Mix 
Arithmetic and logic 1 60% 
Load/store with cache hit 2 18% 
Branch 4 12% 
Memory reference with cache miss 8 10% 
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The average CPI when the program is executed on a uniprocessor with the 
above trace results is CPI = 0.6 + (2 X 0.18) + (4 X 0.12) + (8 X 0.1) = 2.24. 
The corresponding MIPS rate is (400 x 10°)/(2.24 x 10°) = 178. 

Another common performance measure deals only with floating-point in- 
structions. These are common in many scientific and game applications. Floating- 
point performance is expressed as millions of floating-point operations per second 
(MFLOPS), defined as follows: 


Number of executed floating-point operations in a program 
MELOPS rate = f HOWE POUEGE eee 


Execution time X 10° 


Benchmarks 


Measures such as MIPS and MFLOPS have proven inadequate to evaluating the 
performance of processors. Because of differences in instruction sets, the instruction 
execution rate is not a valid means of comparing the performance of different archi- 
tectures. For example, consider this high-level language statement: 


A=B+H+C /* assume all quantities in main memory */ 


With a traditional instruction set architecture, referred to as a complex instruction 
set computer (CISC), this instruction can be compiled into one processor instruction: 


add mem(B), mem(C), mem (A) 


On a typical RISC machine, the compilation would look something like this: 


load mem(B), reg(1); 
load mem(C), reg(2); 
add reg(1), reg(2), reg(3); 
store reg(3), mem (A) 


Because of the nature of the RISC architecture (discussed in Chapter 13), 
both machines may execute the original high-level language instruction in about the 
same time. If this example is representative of the two machines, then if the CISC 
machine is rated at 1 MIPS, the RISC machine would be rated at 4 MIPS. But both 
do the same amount of high-level language work in the same amount of time. 

Further, the performance of a given processor on a given program may not be 
useful in determining how that processor will perform on a very different type of ap- 
plication. Accordingly, beginning in the late 1980s and early 1990s, industry and aca- 
demic interest shifted to measuring the performance of systems using a set of 
benchmark programs. The same set of programs can be run on different machines 
and the execution times compared. 

[WEIC90] lists the following as desirable characteristics of a benchmark 
program: 


1. Itis written in a high-level language, making it portable across different machines. 


2. It is representative of a particular kind of programming style, such as systems 
programming, numerical programming, or commercial programming. 
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3. It can be measured easily. 


4. It has wide distribution. 


SPEC BENCHMARKS The common need in industry and academic and research 
communities for generally accepted computer performance measurements has led to 
the development of standardized benchmark suites. A benchmark suite is a collection 
of programs, defined in a high-level language, that together attempt to provide a rep- 
resentative test of a computer in a particular application or system programming area. 
The best known such collection of benchmark suites is defined and maintained by the 
System Performance Evaluation Corporation (SPEC), an industry consortium. SPEC 
performance measurements are widely used for comparison and research purposes. 

The best known of the SPEC benchmark suites is SPEC CPU2006. This is the in- 
dustry standard suite for processor-intensive applications. That is, SPEC CPU2006 is 
appropriate for measuring performance for applications that spend most of their time 
doing computation rather than I/O. The CPU2006 suite is based on existing applica- 
tions that have already been ported to a wide variety of platforms by SPEC industry 
members. It consists of 17 floating-point programs written in C, C++, and Fortran; 
and 12 integer programs written in C and C++.The suite contains over 3 million lines 
of code. This is the fifth generation of processor-intensive suites from SPEC, replacing 
SPEC CPU2000, SPEC CPU95, SPEC CPU92, and SPEC CPU89 [HENNO7]. 

Other SPEC suites include the following: 


e SPECjvm98: Intended to evaluate performance of the combined hardware 
and software aspects of the Java Virtual Machine (JVM) client platform 


SPECjbb2000 (Java Business Benchmark): A benchmark for evaluating 
server-side Java-based electronic commerce applications 


SPECweb99; Evaluates the performance of World Wide Web (WWW) servers 


e SPECmail2001: Designed to measure a system’s performance acting as a mail 
server 


AVERAGING RESULTS To obtain a reliable comparison of the performance of vari- 
ous computers, it is preferable to run a number of different benchmark programs on 
each machine and then average the results. For example, if m different benchmark 
program, then a simple arithmetic mean can be calculated as follows: 


1 m 
R,=—)>D RB; 2.3 
A m > i ( ) 
where R; is the high-level language instruction execution rate for the ith benchmark 


program. 
An alternative is to take the harmonic mean: 


(2.4) 


Ultimately, the user is concerned with the execution time of a system, not its 
execution rate. If we take arithmetic mean of the instruction rates of various bench- 
mark programs, we get a result that is proportional to the sum of the inverses of 
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execution times. But this is not inversely proportional to the sum of execution times. 
In other words, the arithmetic mean of the instruction rate does not cleanly relate to 
execution time. On the other hand, the harmonic mean instruction rate is the in- 
verse of the average execution time. 

SPEC benchmarks do not concern themselves with instruction execution 
rates. Rather, two fundamental metrics are of interest: a speed metric and a rate met- 
ric. The speed metric measures the ability of a computer to complete a single task. 
SPEC defines a base runtime for each benchmark program using a reference 
machine. Results for a system under test are reported as the ratio of the reference 
run time to the system run time. The ratio is calculated as follows: 


- Tref; 


= 2. 
Tsut; > 


Vj 
where Tref; is the execution time of benchmark program i on the reference system 
and Tsut; is the execution time of benchmark program i on the system under test. 

As an example of the calculation and reporting, consider the Sun Blade 6250, 
which consists of two chips with four cores, or processors, per chip. One of the SPEC 
CPU2006 integer benchmark is 464.h264ref. This is a reference implementation of 
H.264/AVC (Advanced Video Coding), the latest state-of-the-art video compres- 
sion standard. The Sun system executes this program in 934 seconds. The reference 
implementation requires 22,135 seconds. The ratio is calculated as: 22136/934 = 23.7. 

Because the time for the system under test is in the denominator, the larger 
the ratio, the higher the speed. An overall performance measure for the system 
under test is calculated by averaging the values for the ratios for all 12 integer 
benchmarks. SPEC specifies the use of a geometric mean, defined as follows: 


n 1/n 
'g = (11) (2.6) 


where 7; is the ratio for the ith benchmark program. For the Sun Blade 6250, the 
SPEC integer speed ratios were reported as follows: 


Benchmark Ratio Benchmark Ratio 
400.perlbench 17.5 458.sjeng 17.0 
401.bzip2 14.0 462. libquantum 31.3 
403.gcc 13.7 464.h264ref 23.7 
429.mcf 17.6 471.omnetpp 9.23 
445.gobmk 14.7 473.astar 10.9 
456.hmmer 18.6 483.xalancbmk 14.7 


The speed metric is calculated by taking the twelfth root of the product of the 
ratios: 


(17.5 X 14 x 13.7 X 17.6 X 14.7 X 18.6 X 17 X 31.3 X 23.7 X 9.23 x 10.9 x 14.7)? = 18.5 


The rate metric measures the throughput or rate of a machine carrying out a 
number of tasks. For the rate metrics, multiple copies of the benchmarks are run si- 
multaneously. Typically, the number of copies is the same as the number of proces- 
sors on the machine. Again, a ratio is used to report results, although the calculation 
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is more complex. The ratio is calculated as follows: 


_N X Tref; 


2.7 
Tsut; ) 


Yj 
where Tref; is the reference execution time for benchmark i, N is the number of 
copies of the program that are run simultaneously, and T'sut; is the elapsed time from 
the start of the execution of the program on all N processors of the system under 
test until the completion of all the copies of the program. Again, a geometric mean 
is calculated to determine the overall performance measure. 

SPEC chose to use a geometric mean because it is the most appropriate for 
normalized numbers, such as ratios. [FLEM86] demonstrates that the geometric 
mean has the property of performance relationships consistently maintained re- 
gardless of the computer that is used as the basis for normalization. 


Amdahl’s Law 


When considering system performance, computer system designers look for ways to 
improve performance by improvement in technology or change in design. Examples 
include the use of parallel processors, the use of a memory cache hierarchy, and 
speedup in memory access time and I/O transfer rate due to technology improve- 
ments. In all of these cases, it is important to note that a speedup in one aspect of the 
technology or design does not result in a corresponding improvement in perfor- 
mance. This limitation is succinctly expressed by Amdahl’s law. 

Amdahl’s law was first proposed by Gene Amdahl in [AMDA67] and deals 
with the potential speedup of a program using multiple processors compared to a 
single processor. Consider a program running on a single processor such that a frac- 
tion (1 — f) of the execution time involves code that is inherently serial and a frac- 
tion f that involves code that is infinitely parallelizable with no scheduling overhead. 
Let T be the total execution time of the program using a single processor. Then the 
speedup using a parallel processor with N processors that fully exploits the parallel 
portion of the program is as follows: 


time to execute program on a single processor 


d — 
Speedup time to execute program on N parallel processors 
ae Beha 1 
~ Tf ee 
T( : 1 at 
Cane Steps 


Two important conclusions can be drawn: 


1. When f is small, the use of parallel processors has little effect. 


2. As N approaches infinity, speedup is bound by 1/(1 — f), so that there are 
diminishing returns for using more processors. 


These conclusions are too pessimistic, an assertion first put forward in 
[GUST88]. For example, a server can maintain multiple threads or multiple tasks to 
handle multiple clients and execute the threads or tasks in parallel up to the limit of 
the number of processors. Many database applications involve computations on 
massive amounts of data that can be split up into multiple parallel tasks. Nevertheless, 
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Amdahl’s law illustrates the problems facing industry in the development of multi- 
core machines with an ever-growing number of cores: The software that runs on 
such machines must be adapted to a highly parallel execution environment to ex- 
ploit the power of parallel processing. 

Amdahl’s law can be generalized to evaluate any design or technical improve- 
ment in a computer system. Consider any enhancement to a feature of a system that 
results in a speedup. The speedup can be expressed as 


Performance after enhancement Execution time before enhancement 


Speedup = = ; ; 
P P Performance before enhancement Execution time after enhancement 


(2.8) 

Suppose that a feature of the system is used during execution a fraction of the 

time f, before enhancement, and that the speedup of that feature after enhancement 
is SU;. Then the overall speedup of the system is 


1 


Speedup = f 
1-f)+— 
(lf) SU; 
For example, suppose that a task makes extensive use of floating-point operations, 
with 40% of the time is consumed by floating-point operations. With a new hard- 
ware design, the floating-point module is speeded up by a factor of K. Then the 


overall speedup is: 


1 


0.4 
0.6 + = 
K 


Thus, independent of K, the maximum speedup is 1.67. 


2.6 RECOMMENDED READING AND WEB SITES 


A description of the IBM 7000 series can be found in [BELL71]. There is good coverage of the 
IBM 360 in [SIEW82] and of the PDP-8 and other DEC machines in [BELL78a]. These three 
books also contain numerous detailed examples of other computers spanning the history of 
computers through the early 1980s. A more recent book that includes an excellent set of case 
studies of historical machines is [BLAA97].A good history of the microprocessor is [BETK97]. 

[OLUK96], [HAMM97], and [SAKA02] discuss the motivation for multiple processors 
on a single chip. 

[BREY09] provides a good survey of the Intel microprocessor line. The Intel docu- 
mentation itself is also good [INTE08]. 

The most thorough documentation available for the ARM architecture is [SEALO0]."" 
[FURB00] is another excellent source of information. [SMITO08] is an interesting comparison 
of the ARM and x86 approaches to embedding processors in mobile wireless devices. 

For interesting discussions of Moore’s law and its consequences, see [HUTC96], 
[SCHA97], and [BOHR98]. 

[HENNO6] provides a detailed description of each of the benchmarks in CPU2006. 
[SMIT88] discusses the relative merits of arithmetic, harmonic, and geometric means. 


Speedup = 


Known in the ARM community as the “ARM ARM.” 
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Recommended Web sites: 
e Intel Developer’s Page: Intel’s Web page for developers; provides a starting point for 
accessing Pentium information. Also includes the Intel Technology Journal. 


e ARM: Home page of ARM Limited, developer of the ARM architecture. Includes 
technical documentation. 
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e Standard Performance Evaluation Corporation: SPEC is a widely recognized or- 
ganization in the computer industry for its development of standardized benchmarks 
used to measure and compare performance of different computer systems. 

e Top500 Supercomputer Site: Provides brief description of architecture and organi- 
zation of current supercomputer products, plus comparisons. 

e Charles Babbage Institute: Provides links to a number of Web sites dealing with the 
history of computers. 
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Key Terms 


accumulator (AC) instruction cycle opcode 

Amdahl’s law instruction register (IR) original equipment manufac- 
arithmetic and logic unit (ALU) | instruction set turer (OEM) 

benchmark integrated circuit (IC) program control unit 

chip main memory program counter (PC) 

data channel memory address register SPEC 


embedded system (MAR) stored program computer 
execute cycle memory buffer register (MBR) | upward compatible 

fetch cycle microprocessor von Neumann machine 
input-output (I/O) multicore wafer 

instruction buffer register (IBR) | multiplexor word 


Review Questions 


2.1. What is a stored program computer? 

2.2. What are the four main components of any general-purpose computer? 

2.3. At the integrated circuit level, what are the three principal constituents of a computer 
system? 

2.4. Explain Moore’s law. 

2.5. List and explain the key characteristics of a computer family. 

2.6. What is the key distinguishing feature of a microprocessor? 


Problems 


2.1. Let A = A(1),A(2),...,A(1000) and B = B(1), B(2),.. . , B(1000) be two vectors 
(one-dimensional arrays) comprising 1000 numbers each that are to be added to form 
an array C such that C(I) = A(J) + B()) for I = 1,2,..., 1000. Using the IAS in- 
struction set, write a program for this problem. Ignore the fact that the IAS was de- 
signed to have only 1000 words of storage. 

2.2. a. On the IAS, what would the machine code instruction look like to load the con- 

tents of memory address 2? 
b. How many trips to memory does the CPU need to make to complete this instruc- 
tion during the instruction cycle? 

2.3. On the IAS, describe in English the process that the CPU must undertake to read a 
value from memory and to write a value to memory in terms of what is put into the 
MAR, MBR, address bus, data bus, and control bus. 
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2.4. 


2.8. 


2.9. 


2.10. 


Given the memory contents of the [AS computer shown below, 


Address Contents 

08A 010FA210FB 
08B O10FAOFO8D 
08C 020FA210FB 


show the assembly language code for the program, starting at address 08A. Explain 
what this program does. 

In Figure 2.3, indicate the width, in bits, of each data path (e.g., between AC and ALU). 
In the IBM 360 Models 65 and 75, addresses are staggered in two separate main mem- 
ory units (e.g., all even-numbered words in one unit and all odd-numbered words in 
another). What might be the purpose of this technique? 

With reference to Table 2.4, we see that the relative performance of the IBM 360 
Model 75 is 50 times that of the 360 Model 30, yet the instruction cycle time is only 5 
times as fast. How do you account for this discrepancy? 

While browsing at Billy Bob’s computer store, you overhear a customer asking Billy 
Bob what is the fastest computer in the store that he can buy. Billy Bob replies, “You’re 
looking at our Macintoshes. The fastest Mac we have runs at a clock speed of 1.2 giga- 
hertz. If you really want the fastest machine, you should buy our 2.4-gigahertz Intel 
Pentium IV instead.” Is Billy Bob correct? What would you say to help this customer? 
The ENIAC was a decimal machine, where a register was represented by a ring of 10 
vacuum tubes. At any time, only one vacuum tube was in the ON state, representing 
one of the 10 digits. Assuming that ENIAC had the capability to have multiple vacuum 
tubes in the ON and OFF state simultaneously, why is this representation “wasteful” 
and what range of integer values could we represent using the 10 vacuum tubes? 

A benchmark program is run on a 40 MHz processor. The executed program consists of 
100,000 instruction executions, with the following instruction mix and clock cycle count: 


Instruction Type Instruction Count Cycles per Instruction 
Integer arithmetic 45000 1 
Data transfer 32000 2 
Floating point 15000 2 
Control transfer 8000 2 


Determine the effective CPI, MIPS rate, and execution time for this program. 
Consider two different machines, with two different instruction sets, both of which 
have a clock rate of 200 MHz. The following measurements are recorded on the two 
machines running a given set of benchmark programs: 


Instruction Count 
Instruction Type (millions) Cycles per Instruction 


Machine A 


Arithmetic and logic 8 1 

Load and store 4 3 

Branch 2 4 

Others 4 3 
Machine A 

Arithmetic and logic 1 


0 
Load and store 8 
Branch 2 
Others 4 
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a. Determine the effective CPI, MIPS rate, and execution time for each machine. 
b. Comment on the results. 


Early examples of CISC and RISC design are the VAX 11/780 and the IBM RS/6000, 
respectively. Using a typical benchmark program, the following machine characteris- 
tics result: 


Processor Clock Frequency | Performance CPU Time 
VAX 11/780 5 MHz 1 MIPS 12 x seconds 
IBM RS/6000 25 MHz 18 MIPS x seconds 


The final column shows that the VAX required 12 times longer than the IBM mea- 
sured in CPU time. 


a. What is the relative size of the instruction count of the machine code for this 
benchmark program running on the two machines? 


b. What are the CPI values for the two machines? 
Four benchmark programs are executed on three computers with the following results: 


Computer A | Computer B | Computer C 
Program 1 1 10 20 
Program 2 1000 100 20 
Program 3 500 1000 50 
Program 4 100 800 100 


The table shows the execution time in seconds, with 100,000,000 instructions executed in 
each of the four programs. Calculate the MIPS values for each computer for each pro- 
gram. Then calculate the arithmetic and harmonic means assuming equal weights for the 
four programs, and rank the computers based on arithmetic mean and harmonic mean. 
The following table, based on data reported in the literature [HEAT84], shows the ex- 
ecution times, in seconds, for five different benchmark programs on three machines. 


Processor 


Benchmark 


a. Compute the speed metric for each processor for each benchmark, normalized to 
machine R. That is, the ratio values for R are all 1.0. Other ratios are calculated 
using Equation (2.5) with R treated as the reference system. Then compute the 
arithmetic mean value for each system using Equation (2.3). This is the approach 
taken in [HEATS84]. 

b. Repeat part (a) using M as the reference machine. This calculation was not tried in 
[HEATS84]. 


c. Which machine is the slowest based on each of the preceding two calculations? 


d. Repeat the calculations of parts (a) and (b) using the geometric mean, defined in 
Equation (2.6). Which machine is the slowest based on the two calculations? 
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2.15. 


2.16. 


2.17. 


To clarify the results of the preceding problem, we look at a simpler example. 


Renchniark Processor 
x Y Z 
1 20 10 40 
2 40 80 20 


a. Compute the arithmetic mean value for each system using X as the reference ma- 
chine and then using Y as the reference machine. Argue that intuitively the three 
machines have roughly equivalent performance and that the arithmetic mean 
gives misleading results. 

b. Compute the geometric mean value for each system using X as the reference ma- 
chine and then using Y as the reference machine. Argue that the results are more 
realistic than with the arithmetic mean. 


Consider the example in Section 2.5 for the calculation of average CPI and MIPS 
rate, which yielded the result of CPI = 2.24 and MIPS rate = 178. Now assume that the 
program can be executed in eight parallel tasks or threads with roughly equal number 
of instructions executed in each task. Execution is on an 8-core system with each core 
(processor) having the same performance as the single processor originally used. 
Coordination and synchronization between the parts adds an extra 25,000 instruction 
executions to each task. Assume the same instruction mix as in the example for 
each task, but increase the CPI for memory reference with cache miss to 12 cycles 
due to contention for memory. 


a. Determine the average CPI. 
b. Determine the corresponding MIPS rate. 
c. Calculate the speedup factor. 


d. Compare the actual speedup factor with the theoretical speedup factor deter- 
mined by Amdhal’s law. 

A processor accesses main memory with an average access time of T>. A smaller 
cache memory is interposed between the processor and main memory. The cache has 
a significantly faster access time of T; < 7 . The cache holds, at any time, copies of 
some main memory words and is designed so that the words more likely to be ac- 
cessed in the near future are in the cache. Assume that the probability that the next 
word accessed by the processor is in the cache is H, known as the hit ratio. 


a. For any single memory access, what is the theoretical speedup of accessing the 
word in the cache rather than in main memory? 

b. Let T be the average access time. Express T as a function of T;, 72, and H.What is 
the overall speedup as a function of H? 

In practice, a system may be designed so that the processor must first access the 
cache to determine if the word is in the cache and, if it is not, then access main 
memory, so that on a miss (opposite of a hit), memory access time is 7, + T. Ex- 
press T as a function of T,, 7, and H. Now calculate the speedup and compare to 
the result produced in part (b). 
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PART Two 


The Computer System 


P.1 ISSUES FOR PART TWO 


A computer system consists of a processor, memory, I/O, and the interconnections 
among these major components. With the exception of the processor, which is suffi- 
ciently complex to devote Part Three to its study, Part Two examines each of these 
components in detail. 


ROAD MAP FOR PART TWO 


Chapter 3 A Top-Level View of Computer Function 
and Interconnection 


At a top level, a computer consists of a processor, memory, and I/O compo- 
nents. The functional behavior of the system consists of the exchange of data 
and control signals among these components. To support this exchange, these 
components must be interconnected. Chapter 3 begins with a brief examina- 
tion of the computer’s components and their input-output requirements. The 
chapter then looks at key issues that affect interconnection design, especially 
the need to support interrupts. The bulk of the chapter is devoted to a study of 
the most common approach to interconnection: the use of a structure of buses. 


Chapter 4 Cache Memory 


Computer memory exhibits a wide range of type, technology, organiza- 
tion, performance, and cost. The typical computer system is equipped with 
a hierarchy of memory subsystems, some internal (directly accessible by 
the processor) and some external (accessible by the processor via an I/O 
module). Chapter 4 begins with an overview of this hierarchy. Next, the 
chapter deals in detail with the design of cache memory, including sepa- 
rate code and data caches and two-level caches. 
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Chapter 5 Internal Memory 


The design of a main memory system is a never-ending battle among 
three competing design requirements: large storage capacity, rapid access 
time, and low cost. As memory technology evolves, each of these three 
characteristics is changing, so that the design decisions in organizing main 
memory must be revisited anew with each new implementation. Chapter 
5 focuses on design issues related to internal memory. First, the nature 
and organization of semiconductor main memory is examined. Then, 
recent advanced DRAM memory organizations are explored. 


Chapter 6 External Memory 


For truly large storage capacity and for more permanent storage than is 
available with main memory, an external memory organization is needed. 
The most widely used type of external memory is magnetic disk, and 
much of Chapter 6 concentrates on this topic. First, we look at magnetic 
disk technology and design considerations. Then, we look at the use of 
RAID organization to improve disk memory performance. Chapter 6 also 
examines optical and tape storage. 


Chapter 7 Input/Output 


I/O modules are interconnected with the processor and main memory, and 
each controls one or more external devices. Chapter 7 is devoted to the var- 
ious aspects of I/O organization. This is a complex area, and less well under- 
stood than other areas of computer system design in terms of meeting 
performance demands. Chapter 7 examines the mechanisms by which an 
I/O module interacts with the rest of the computer system, using the tech- 
niques of programmed I/O, interrupt I/O, and direct memory access (DMA). 
The interface between an I/O module and external devices is also described. 


Chapter 8 Operating System Support 


A detailed examination of operating systems (OSs) is beyond the scope 
of this book. However, it is important to understand the basic functions of 
an operating system and how the OS exploits hardware to provide the de- 
sired performance. Chapter 8 describes the basic principles of operating 
systems and discusses the specific design features in the computer hard- 
ware intended to provide support for the operating system. The chapter 
begins with a brief history, which serves to identify the major types of op- 
erating systems and to motivate their use. Next, multiprogramming is ex- 
plained by examining the long-term and short-term scheduling functions. 
Finally, an examination of memory management includes a discussion of 
segmentation, paging, and virtual memory. 
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KEY POINTS 


® An instruction cycle consists of an instruction fetch, followed by zero or 
more operand fetches, followed by zero or more operand stores, followed 
by an interrupt check (if interrupts are enabled). 


® The major computer system components (processor, main memory, I/O 
modules) need to be interconnected in order to exchange data and control 
signals. The most popular means of interconnection is the use of a shared 
system bus consisting of multiple lines. In contemporary systems, there typ- 
ically is a hierarchy of buses to improve performance. 


® Key design elements for buses include arbitration (whether permission to 
send signals on bus lines is controlled centrally or in a distributed fashion); 
timing (whether signals on the bus are synchronized to a central clock or 
are sent asynchronously based on the most recent transmission); and width 
(number of address lines and number of data lines). 


At a top level, a computer consists of CPU (central processing unit), memory, and I/O 
components, with one or more modules of each type. These components are intercon- 
nected in some fashion to achieve the basic function of the computer, which is to exe- 
cute programs. Thus, at a top level, we can describe a computer system by (1) describing 
the external behavior of each component—that is, the data and control signals that it 
exchanges with other components; and (2) describing the interconnection structure 
and the controls required to manage the use of the interconnection structure. 

This top-level view of structure and function is important because of its explana- 
tory power in understanding the nature of a computer. Equally important is its use to 
understand the increasingly complex issues of performance evaluation. A grasp of the 
top-level structure and function offers insight into system bottlenecks, alternate path- 
ways, the magnitude of system failures if a component fails, and the ease of adding per- 
formance enhancements. In many cases, requirements for greater system power and 
fail-safe capabilities are being met by changing the design rather than merely increas- 
ing the speed and reliability of individual components. 

This chapter focuses on the basic structures used for computer component in- 
terconnection. As background, the chapter begins with a brief examination of the 
basic components and their interface requirements. Then a functional overview is 
provided. We are then prepared to examine the use of buses to interconnect system 
components. 


3.1 COMPUTER COMPONENTS 


As discussed in Chapter 2, virtually all contemporary computer designs are based 
on concepts developed by John von Neumann at the Institute for Advanced Studies, 
Princeton. Such a design is referred to as the von Neumann architecture and is based 
on three key concepts: 
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e Data and instructions are stored in a single read—write memory. 


e The contents of this memory are addressable by location, without regard to 
the type of data contained there. 


e Execution occurs in a sequential fashion (unless explicitly modified) from one 
instruction to the next. 


The reasoning behind these concepts was discussed in Chapter 2 but is worth 
summarizing here. There is a small set of basic logic components that can be com- 
bined in various ways to store binary data and to perform arithmetic and logical op- 
erations on that data. If there is a particular computation to be performed, a 
configuration of logic components designed specifically for that computation could 
be constructed. We can think of the process of connecting the various components in 
the desired configuration as a form of programming. The resulting “program” is in 
the form of hardware and is termed a hardwired program. 

Now consider this alternative. Suppose we construct a general-purpose config- 
uration of arithmetic and logic functions. This set of hardware will perform various 
functions on data depending on control signals applied to the hardware. In the orig- 
inal case of customized hardware, the system accepts data and produces results 
(Figure 3.1a). With general-purpose hardware, the system accepts data and control 
signals and produces results. Thus, instead of rewiring the hardware for each new 
program, the programmer merely needs to supply a new set of control signals. 

How shall control signals be supplied? The answer is simple but subtle. The en- 
tire program is actually a sequence of steps. At each step, some arithmetic or logical 


Sequence of 


Data arithmetic 
and logic 
functions 


Results 


(a) Programming in hardware 


Instruction Instruction 
codes interpreter 


Control 
signals 


General-purpose 


arithmetic 
and logic 
functions 


Data Results 


(b) Programming in software 


Figure 3.1 Hardware and Software Approaches 
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operation is performed on some data. For each step, a new set of control signals is 
needed. Let us provide a unique code for each possible set of control signals, and let 
us add to the general-purpose hardware a segment that can accept a code and gen- 
erate control signals (Figure 3.1b). 

Programming is now much easier. Instead of rewiring the hardware for each 
new program, all we need to do is provide a new sequence of codes. Each code is, in 
effect, an instruction, and part of the hardware interprets each instruction and gen- 
erates control signals. To distinguish this new method of programming, a sequence 
of codes or instructions is called software. 

Figure 3.1b indicates two major components of the system: an instruction in- 
terpreter and a module of general-purpose arithmetic and logic functions. These two 
constitute the CPU. Several other components are needed to yield a functioning 
computer. Data and instructions must be put into the system. For this we need some 
sort of input module. This module contains basic components for accepting data and 
instructions in some form and converting them into an internal form of signals us- 
able by the system. A means of reporting results is needed, and this is in the form of 
an output module. Taken together, these are referred to as I/O components. 

One more component is needed. An input device will bring instructions and 
data in sequentially. But a program is not invariably executed sequentially; it may 
jump around (e.g., the [AS jump instruction). Similarly, operations on data may re- 
quire access to more than just one element at a time in a predetermined sequence. 
Thus, there must be a place to store temporarily both instructions and data. That 
module is called memory, or main memory to distinguish it from external storage or 
peripheral devices. Von Neumann pointed out that the same memory could be used 
to store both instructions and data. 

Figure 3.2 illustrates these top-level components and suggests the interactions 
among them. The CPU exchanges data with memory. For this purpose, it typically 
makes use of two internal (to the CPU) registers: a memory address register 
(MAR), which specifies the address in memory for the next read or write, and a 
memory buffer register (MBR), which contains the data to be written into memory 
or receives the data read from memory. Similarly, an I/O address register (I/OAR) 
specifies a particular I/O device. An I/O buffer (I/OBR) register is used for the ex- 
change of data between an I/O module and the CPU. 

A memory module consists of a set of locations, defined by sequentially num- 
bered addresses. Each location contains a binary number that can be interpreted as 
either an instruction or data. An I/O module transfers data from external devices to 
CPU and memory, and vice versa. It contains internal buffers for temporarily hold- 
ing these data until they can be sent on. 

Having looked briefly at these major components, we now turn to an overview 
of how these components function together to execute programs. 


3.2; COMPUTER FUNCTION 


The basic function performed by a computer is execution of a program, which con- 
sists of a set of instructions stored in memory. The processor does the actual work by 
executing instructions specified in the program. This section provides an overview of 


3.2 / COMPUTER FUNCTION 69 


CPU Main memory 
0 
System : 1 
7 2 
PC MAR pus : 
Instruction ° 
Instruction . 
saa == Instruction 
IR MBR 
1/0 AR . 
VO BR a 
Data 
Data 
I/O Module : sale 
n-1 
: PC = Program counter 
Buffers IR = Instruction register 
MAR = Memory address register 
MBR = Memory buffer register 
VYOAR- = __Input/output address register 
VYOBR = __Input/output buffer register 


Figure 3.2 Computer Components:Top-Level View 


the key elements of program execution. In its simplest form, instruction processing 
consists of two steps: The processor reads ( fetches) instructions from memory one at 
a time and executes each instruction. Program execution consists of repeating the 
process of instruction fetch and instruction execution. The instruction execution 
may involve several operations and depends on the nature of the instruction (see, 
for example, the lower portion of Figure 2.4). 

The processing required for a single instruction is called an instruction cycle. 
Using the simplified two-step description given previously, the instruction cycle is de- 
picted in Figure 3.3. The two steps are referred to as the fetch cycle and the execute 
cycle. Program execution halts only if the machine is turned off, some sort of unrecov- 
erable error occurs, or a program instruction that halts the computer is encountered. 


Instruction Fetch and Execute 


At the beginning of each instruction cycle, the processor fetches an instruction from 
memory. In a typical processor, a register called the program counter (PC) holds the 
address of the instruction to be fetched next. Unless told otherwise, the processor 
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Figure 3.3. Basic Instruction Cycle 


always increments the PC after each instruction fetch so that it will fetch the next in- 
struction in sequence (i.e., the instruction located at the next higher memory ad- 
dress). So, for example, consider a computer in which each instruction occupies one 
16-bit word of memory. Assume that the program counter is set to location 300. The 
processor will next fetch the instruction at location 300. On succeeding instruction 
cycles, it will fetch instructions from locations 301, 302, 303, and so on. This sequence 
may be altered, as explained presently. 

The fetched instruction is loaded into a register in the processor known as the 
instruction register (IR). The instruction contains bits that specify the action the 
processor is to take. The processor interprets the instruction and performs the re- 
quired action. In general, these actions fall into four categories: 


e Processor-memory: Data may be transferred from processor to memory or 
from memory to processor. 


e Processor-I/O: Data may be transferred to or from a peripheral device by 
transferring between the processor and an I/O module. 


e Data processing: The processor may perform some arithmetic or logic opera- 
tion on data. 


e Control: An instruction may specify that the sequence of execution be altered. 
For example, the processor may fetch an instruction from location 149, which 
specifies that the next instruction be from location 182. The processor will re- 
member this fact by setting the program counter to 182. Thus, on the next fetch 
cycle, the instruction will be fetched from location 182 rather than 150. 


An instruction’s execution may involve a combination of these actions. 

Consider a simple example using a hypothetical machine that includes the 
characteristics listed in Figure 3.4. The processor contains a single data register, 
called an accumulator (AC). Both instructions and data are 16 bits long. Thus, it is 
convenient to organize memory using 16-bit words. The instruction format provides 
4 bits for the opcode, so that there can be as many as 2* = 16 different opcodes, and 
up to 2!” = 4096 (4K) words of memory can be directly addressed. 

Figure 3.5 illustrates a partial program execution, showing the relevant por- 
tions of memory and processor registers.! The program fragment shown adds the 
contents of the memory word at address 940 to the contents of the memory word at 


‘Hexadecimal notation is used, in which each digit represents 4 bits. This is the most convenient notation 
for representing the contents of memory and registers when the word length is a multiple of 4. See Chap- 
ter 19 for a basic refresher on number systems (decimal, binary, hexadecimal). 
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Program counter (PC) = Address of instruction 
Instruction register (IR) = Instruction being executed 
Accumulator (AC) = Temporary storage 


(c) Internal CPU registers 
0001 = Load AC from memory 
0010 = Store AC to memory 
0101 = Add to AC from memory 


(d) Partial list of opcodes 
Figure 3.4 Characteristics of a Hypothetical Machine 


Memory CPU registers Memory CPU registers 
300}1 9 4 0 Sm] PC 300/1 9 4 0 Bm) PC 
sor [sos] | | AC} 301|5 9 4 1 0 0 0 3/AC 
302|2 9 4 1 1 9 4 O/ IR | 302/2 9 4 1 19 4 O;IR 
940/0 0 0 3 9401/0 0 0 3 
941/0 0 0 2 941|}0 0 0 2 
Step | Step 2 

Memory CPU registers Memory CPU registers 
300}1 9 4 0 Smee) PC 300/1 9 4 0 3 0 2]PC 
~~ fe ie 0 0 0 5} AC 
302|2 9 4 1 5.9 4 L/IR i 
940/0 0 0 3 9401/0 0 0 3 342=5 
o41[0 0 0 2 941 [oe}—_—__7 
Step 3 Step 4 

Memory CPU registers Memory CPU registers 
300}1 9 4 0 Sm) PC 300/1 9 4 0 3 0 3)Bye 
301}5 9 4 1 0.0 0 5}AC} 301|5 9 4 1 0 0 0 5})AC 
302|2 9 4 1}—>|2 9 4 1/IR | 302/2 9 4 1 2 @ 4 il ian 
940|0 0 0 3 940|0 0 0 3 
941/0 0 0 2 94110 0 0 5 
Step 5 Step 6 


Figure 3.5 Example of Program Execution (contents of memory and 
registers in hexadecimal) 
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address 941 and stores the result in the latter location. Three instructions, which can 
be described as three fetch and three execute cycles, are required: 


1. The PC contains 300, the address of the first instruction. This instruction (the 
value 1940 in hexadecimal) is loaded into the instruction register IR and the 
PC is incremented. Note that this process involves the use of a memory ad- 
dress register (MAR) and a memory buffer register (MBR). For simplicity, 
these intermediate registers are ignored. 

. The first 4 bits (first hexadecimal digit) in the IR indicate that the AC is to be 
loaded. The remaining 12 bits (three hexadecimal digits) specify the address 
(940) from which data are to be loaded. 


3. The next instruction (5941) is fetched from location 301 and the PC is 
incremented. 


4. The old contents of the AC and the contents of location 941 are added and the 
result is stored in the AC. 


. The next instruction (2941) is fetched from location 302 and the PC is 
incremented. 


6. The contents of the AC are stored in location 941. 


nN 


wn 


In this example, three instruction cycles, each consisting of a fetch cycle and an 
execute cycle, are needed to add the contents of location 940 to the contents of 941. 
With a more complex set of instructions, fewer cycles would be needed. Some older 
processors, for example, included instructions that contain more than one memory 
address. Thus the execution cycle for a particular instruction on such processors 
could involve more than one reference to memory. Also, instead of memory refer- 
ences, an instruction may specify an I/O operation. 

For example, the PDP-11 processor includes an instruction, expressed symbol- 
ically as ADD B,A, that stores the sum of the contents of memory locations B and A 
into memory location A. A single instruction cycle with the following steps occurs: 


e Fetch the ADD instruction. 
e Read the contents of memory location A into the processor. 


e Read the contents of memory location B into the processor. In order that the 
contents of A are not lost, the processor must have at least two registers for 
storing memory values, rather than a single accumulator. 


e Add the two values. 


e Write the result from the processor to memory location A. 


Thus, the execution cycle for a particular instruction may involve more than 
one reference to memory. Also, instead of memory references, an instruction may 
specify an I/O operation. With these additional considerations in mind, Figure 3.6 
provides a more detailed look at the basic instruction cycle of Figure 3.3. The figure is 
in the form of a state diagram. For any given instruction cycle, some states may be 
null and others may be visited more than once. The states can be described as follows: 


e Instruction address calculation (iac): Determine the address of the next in- 
struction to be executed. Usually, this involves adding a fixed number to the 
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Figure 3.6 Instruction Cycle State Diagram 


address of the previous instruction. For example, if each instruction is 16 bits 
long and memory is organized into 16-bit words, then add 1 to the previous ad- 
dress. If, instead, memory is organized as individually addressable 8-bit bytes, 
then add 2 to the previous address. 


e Instruction fetch (if): Read instruction from its memory location into the 
processor. 


e Instruction operation decoding (iod): Analyze instruction to determine type 
of operation to be performed and operand(s) to be used. 


e Operand address calculation (oac): If the operation involves reference to an 
operand in memory or available via I/O, then determine the address of the 
operand. 


e Operand fetch (of): Fetch the operand from memory or read it in from I/O. 
e Data operation (do): Perform the operation indicated in the instruction. 


e Operand store (os): Write the result into memory or out to I/O. 


States in the upper part of Figure 3.6 involve an exchange between the processor 
and either memory or an I/O module. States in the lower part of the diagram involve 
only internal processor operations. The oac state appears twice, because an instruction 
may involve a read, a write, or both. However, the action performed during that state 
is fundamentally the same in both cases, and so only a single state identifier is needed. 

Also note that the diagram allows for multiple operands and multiple results, 
because some instructions on some machines require this. For example, the PDP-11 
instruction ADD A,B results in the following sequence of states: iac, if, iod, oac, of, 
oac, of, do, oac, os. 

Finally, on some machines, a single instruction can specify an operation to be per- 
formed on a vector (one-dimensional array) of numbers or a string (one-dimensional 
array) of characters. As Figure 3.6 indicates, this would involve repetitive operand 
fetch and/or store operations. 
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Table 3.1 Classes of Interrupts 


Generated by some condition that occurs as a result of an instruction 
execution, such as arithmetic overflow, division by zero, attempt to 
execute an illegal machine instruction, or reference outside a user’s 
allowed memory space. 


Generated by a timer within the processor. This allows the operating 
system to perform certain functions on a regular basis. 


vo Generated by an I/O controller, to signal normal completion of an 
operation or to signal a variety of error conditions. 


Hardware failure Generated by a failure such as power failure or memory parity error. 


Interrupts 


Virtually all computers provide a mechanism by which other modules (I/O, mem- 
ory) may interrupt the normal processing of the processor. Table 3.1 lists the most 
common classes of interrupts. The specific nature of these interrupts is examined 
later in this book, especially in Chapters 7 and 12. However, we need to introduce 
the concept now to understand more clearly the nature of the instruction cycle and 
the implications of interrupts on the interconnection structure. The reader need not 
be concerned at this stage about the details of the generation and processing of in- 
terrupts, but only focus on the communication between modules that results from 
interrupts. 

Interrupts are provided primarily as a way to improve processing efficiency. 
For example, most external devices are much slower than the processor. Suppose 
that the processor is transferring data to a printer using the instruction cycle scheme 
of Figure 3.3. After each write operation, the processor must pause and remain idle 
until the printer catches up. The length of this pause may be on the order of many 
hundreds or even thousands of instruction cycles that do not involve memory. 
Clearly, this is a very wasteful use of the processor. 

Figure 3.7a illustrates this state of affairs. The user program performs a series 
of WRITE calls interleaved with processing. Code segments 1, 2, and 3 refer to se- 
quences of instructions that do not involve I/O. The WRITE calls are to an I/O pro- 
gram that is a system utility and that will perform the actual I/O operation. The I/O 
program consists of three sections: 


e A sequence of instructions, labeled 4 in the figure, to prepare for the actual I/O 
operation. This may include copying the data to be output into a special buffer 
and preparing the parameters for a device command. 


e The actual I/O command. Without the use of interrupts, once this command 
is issued, the program must wait for the I/O device to perform the requested 
function (or periodically poll the device). The program might wait by simply 
repeatedly performing a test operation to determine if the I/O operation 
is done. 


e A sequence of instructions, labeled 5 in the figure, to complete the opera- 
tion. This may include setting a flag indicating the success or failure of the 
operation. 
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(c) Interrupts; long I/O wait 


(b) Interrupts; short I/O wait 


(a) No interrupts 
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‘igure 3.7 Program Flow of Control without and with Interrupts 
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Because the I/O operation may take a relatively long time to complete, the 
I/O program is hung up waiting for the operation to complete; hence, the user 
program is stopped at the point of the WRITE call for some considerable period 
of time. 


INTERRUPTS AND THE INSTRUCTION CYCLE With interrupts, the processor can 
be engaged in executing other instructions while an I/O operation is in progress. 
Consider the flow of control in Figure 3.7b. As before, the user program reaches a 
point at which it makes a system call in the form of a WRITE call. The I/O program 
that is invoked in this case consists only of the preparation code and the actual I/O 
command. After these few instructions have been executed, control returns to the 
user program. Meanwhile, the external device is busy accepting data from computer 
memory and printing it. This I/O operation is conducted concurrently with the exe- 
cution of instructions in the user program. 

When the external device becomes ready to be serviced—that is, when it is 
ready to accept more data from the processor,—the I/O module for that external 
device sends an interrupt request signal to the processor. The processor responds by 
suspending operation of the current program, branching off to a program to service 
that particular I/O device, known as an interrupt handler, and resuming the original 
execution after the device is serviced. The points at which such interrupts occur are 
indicated by an asterisk in Figure 3.7b. 

From the point of view of the user program, an interrupt is just that: an inter- 
ruption of the normal sequence of execution. When the interrupt processing is com- 
pleted, execution resumes (Figure 3.8). Thus, the user program does not have to 
contain any special code to accommodate interrupts; the processor and the operat- 
ing system are responsible for suspending the user program and then resuming it at 
the same point. 

To accommodate interrupts, an interrupt cycle is added to the instruction cycle, 
as shown in Figure 3.9. In the interrupt cycle, the processor checks to see if any 


User program Interrupt handler 


Interrupt 
occurs here : | 
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M 
Figure 3.8 Transfer of Control via Interrupts 
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Figure 3.9 Instruction Cycle with Interrupts 


interrupts have occurred, indicated by the presence of an interrupt signal. If no 
interrupts are pending, the processor proceeds to the fetch cycle and fetches the 
next instruction of the current program. If an interrupt is pending, the processor 
does the following: 


e It suspends execution of the current program being executed and saves its 
context. This means saving the address of the next instruction to be executed 
(current contents of the program counter) and any other data relevant to the 
processor’s current activity. 


e Itsets the program counter to the starting address of an interrupt handler routine. 


The processor now proceeds to the fetch cycle and fetches the first instruction 
in the interrupt handler program, which will service the interrupt. The interrupt han- 
dler program is generally part of the operating system. Typically, this program deter- 
mines the nature of the interrupt and performs whatever actions are needed. In the 
example we have been using, the handler determines which I/O module generated 
the interrupt and may branch to a program that will write more data out to that I/O 
module. When the interrupt handler routine is completed, the processor can resume 
execution of the user program at the point of interruption. 

It is clear that there is some overhead involved in this process. Extra instructions 
must be executed (in the interrupt handler) to determine the nature of the interrupt 
and to decide on the appropriate action. Nevertheless, because of the relatively large 
amount of time that would be wasted by simply waiting on an I/O operation, the 
processor can be employed much more efficiently with the use of interrupts. 

To appreciate the gain in efficiency, consider Figure 3.10, which is a timing dia- 
gram based on the flow of control in Figures 3.7a and 3.7b. Figures 3.7b and 3.10 as- 
sume that the time required for the I/O operation is relatively short: less than the 
time to complete the execution of instructions between write operations in the user 
program. The more typical case, especially for a slow device such as a printer, is that 
the I/O operation will take much more time than executing a sequence of user in- 
structions. Figure 3.7c indicates this state of affairs. In this case, the user program 
reaches the second WRITE call before the I/O operation spawned by the first call is 
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Figure 3.10 Program Timing: Short I/O Wait 


complete. The result is that the user program is hung up at that point. When the 
preceding I/O operation is completed, this new WRITE call may be processed, and 
a new I/O operation may be started. Figure 3.11 shows the timing for this situation 
with and without the use of interrupts. We can see that there is still a gain in effi- 
ciency because part of the time during which the I/O operation is underway over- 
laps with the execution of user instructions. 

Figure 3.12 shows a revised instruction cycle state diagram that includes inter- 
rupt cycle processing. 


MULTIPLE INTERRUPTS The discussion so far has focused only on the occur- 
rence of a single interrupt. Suppose, however, that multiple interrupts can occur. 
For example, a program may be receiving data from a communications line and 
printing results. The printer will generate an interrupt every time that it com- 
pletes a print operation. The communication line controller will generate an in- 
terrupt every time a unit of data arrives. The unit could either be a single 
character or a block, depending on the nature of the communications discipline. 
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Figure 3.11 Program Timing: Long I/O Wait 


In any case, it is possible for a communications interrupt to occur while a printer 
interrupt is being processed. 

Two approaches can be taken to dealing with multiple interrupts. The first is to 
disable interrupts while an interrupt is being processed. A disabled interrupt simply 
means that the processor can and will ignore that interrupt request signal. If an inter- 
rupt occurs during this time, it generally remains pending and will be checked by the 
processor after the processor has enabled interrupts. Thus, when a user program is exe- 
cuting and an interrupt occurs, interrupts are disabled immediately. After the interrupt 
handler routine completes, interrupts are enabled before resuming the user program, 
and the processor checks to see if additional interrupts have occurred. This approach is 
nice and simple, as interrupts are handled in strict sequential order (Figure 3.13a). 
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Figure 3.12 Instruction Cycle State Diagram, with Interrupts 
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(b) Nested interrupt processing 
Figure 3.13 Transfer of Control with Multiple Interrupts 
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.14 Example Time Sequence of Multiple Interrupts 


The drawback to the preceding approach is that it does not take into account 
relative priority or time-critical needs. For example, when input arrives from the 
communications line, it may need to be absorbed rapidly to make room for more 
input. If the first batch of input has not been processed before the second batch 
arrives, data may be lost. 

A second approach is to define priorities for interrupts and to allow an interrupt 
of higher priority to cause a lower-priority interrupt handler to be itself interrupted 
(Figure 3.13b). As an example of this second approach, consider a system with three 
I/O devices: a printer, a disk, and a communications line, with increasing priorities of 2, 
4, and 5, respectively. Figure 3.14, based on an example in [TANE97], illustrates a pos- 
sible sequence. A user program begins at t = 0. Att = 10, a printer interrupt occurs; 
user information is placed on the system stack and execution continues at the printer 
interrupt service routine (ISR). While this routine is still executing, at tf = 15, a com- 
munications interrupt occurs. Because the communications line has higher priority 
than the printer, the interrupt is honored. The printer ISR is interrupted, its state is 
pushed onto the stack, and execution continues at the communications ISR. While this 
routine is executing, a disk interrupt occurs (¢ = 20). Because this interrupt is of lower 
priority, it is simply held, and the communications ISR runs to completion. 

When the communications ISR is complete (¢ = 25), the previous processor 
state is restored, which is the execution of the printer ISR. However, before even a 
single instruction in that routine can be executed, the processor honors the higher- 
priority disk interrupt and control transfers to the disk ISR. Only when that routine is 
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complete (t = 35) is the printer ISR resumed. When that routine completes (¢ = 40), 
control finally returns to the user program. 


I/O Function 


Thus far, we have discussed the operation of the computer as controlled by the 
processor, and we have looked primarily at the interaction of processor and mem- 
ory. The discussion has only alluded to the role of the I/O component. This role is 
discussed in detail in Chapter 7, but a brief summary is in order here. 

An I/O module (e.g., a disk controller) can exchange data directly with the 
processor. Just as the processor can initiate a read or write with memory, designat- 
ing the address of a specific location, the processor can also read data from or write 
data to an I/O module. In this latter case, the processor identifies a specific device 
that is controlled by a particular I/O module. Thus, an instruction sequence similar 
in form to that of Figure 3.5 could occur, with I/O instructions rather than memory- 
referencing instructions. 

In some cases, it is desirable to allow I/O exchanges to occur directly with 
memory. In such a case, the processor grants to an I/O module the authority to read 
from or write to memory, so that the I/O-memory transfer can occur without tying 
up the processor. During such a transfer, the I/O module issues read or write com- 
mands to memory, relieving the processor of responsibility for the exchange. This 
operation is known as direct memory access (DMA) and is examined Chapter 7. 


3.3. INTERCONNECTION STRUCTURES 


A computer consists of a set of components or modules of three basic types (proces- 
sor, memory, I/O) that communicate with each other. In effect, a computer is a net- 
work of basic modules. Thus, there must be paths for connecting the modules. 

The collection of paths connecting the various modules is called the 
interconnection structure. The design of this structure will depend on the exchanges 
that must be made among modules. 

Figure 3.15 suggests the types of exchanges that are needed by indicating the 
major forms of input and output for each module type:? 


e Memory: Typically, a memory module will consist of N words of equal length. 
Each word is assigned a unique numerical address (0, 1,..., N— 1). A word of 
data can be read from or written into the memory. The nature of the operation 
is indicated by read and write control signals. The location for the operation is 
specified by an address. 


e I/O module: From an internal (to the computer system) point of view, I/O is 
functionally similar to memory. There are two operations, read and write. Fur- 
ther, an I/O module may control more than one external device. We can refer 
to each of the interfaces to an external device as a port and give each a unique 
address (e.g.,0,1,..., M—1). In addition, there are external data paths for the 


The wide arrows represent multiple signal lines carrying multiple bits of information in parallel. Each 
narrow arrows represents a single signal line. 


84 CHAPTER 3 / A TOP-LEVEL VIEW OF COMPUTER FUNCTION 


Read 
Memory 
Write 
N words 
Address COOL) | Data > 
Data N-1 071110) 
Read 
1/O module Internal : 
Write data 
External 
Address M ports data 
Internal 
data Interrupt 2 
External unas 
data 
Instructions Address . 
Control 
Data CPU signals 
Interrupt Data E 
signals 


Figure 3.15 Computer Modules 


input and output of data with an external device. Finally, an I/O module may 
be able to send interrupt signals to the processor. 


e Processor: The processor reads in instructions and data, writes out data after 
processing, and uses control signals to control the overall operation of the sys- 
tem. It also receives interrupt signals. 


The preceding list defines the data to be exchanged. The interconnection 
structure must support the following types of transfers: 


e Memory to processor: The processor reads an instruction or a unit of data 
from memory. 

e Processor to memory: The processor writes a unit of data to memory. 

e I/O to processor: The processor reads data from an I/O device via an I/O module. 

e Processor to I/O: The processor sends data to the I/O device. 


e I/O to or from memory: For these two cases, an I/O module is allowed to ex- 
change data directly with memory, without going through the processor, using 
direct memory access (DMA). 
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Over the years, a number of interconnection structures have been tried. By far 
the most common is the bus and various multiple-bus structures. The remainder of 
this chapter is devoted to an assessment of bus structures. 


3.4 BUS INTERCONNECTION 


A bus is a communication pathway connecting two or more devices. A key charac- 
teristic of a bus is that it is a shared transmission medium. Multiple devices connect 
to the bus, and a signal transmitted by any one device is available for reception by all 
other devices attached to the bus. If two devices transmit during the same time pe- 
riod, their signals will overlap and become garbled. Thus, only one device at a time 
can successfully transmit. 

Typically, a bus consists of multiple communication pathways, or lines. Each 
line is capable of transmitting signals representing binary 1 and binary 0. Over time, 
a sequence of binary digits can be transmitted across a single line. Taken together, 
several lines of a bus can be used to transmit binary digits simultaneously (in paral- 
lel). For example, an 8-bit unit of data can be transmitted over eight bus lines. 

Computer systems contain a number of different buses that provide pathways 
between components at various levels of the computer system hierarchy. A bus that 
connects major computer components (processor, memory, I/O) is called a system 
bus. The most common computer interconnection structures are based on the use of 
one or more system buses. 


Bus Structure 


A system bus consists, typically, of from about 50 to hundreds of separate lines. Each 
line is assigned a particular meaning or function. Although there are many different 
bus designs, on any bus the lines can be classified into three functional groups 
(Figure 3.16): data, address, and control lines. In addition, there may be power distri- 
bution lines that supply power to the attached modules. 

The data lines provide a path for moving data among system modules. These 
lines, collectively, are called the data bus. The data bus may consist of 32, 64, 128, or 
even more separate lines, the number of lines being referred to as the width of the 
data bus. Because each line can carry only 1 bit at a time, the number of lines deter- 
mines how many bits can be transferred at a time. The width of the data bus is a key 
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Figure 3.16 Bus Interconnection Scheme 
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factor in determining overall system performance. For example, if the data bus is 
32 bits wide and each instruction is 64 bits long, then the processor must access the 
memory module twice during each instruction cycle. 

The address lines are used to designate the source or destination of the data 
on the data bus. For example, if the processor wishes to read a word (8, 16, or 
32 bits) of data from memory, it puts the address of the desired word on the address 
lines. Clearly, the width of the address bus determines the maximum possible mem- 
ory capacity of the system. Furthermore, the address lines are generally also used 
to address I/O ports. Typically, the higher-order bits are used to select a particular 
module on the bus, and the lower-order bits select a memory location or I/O port 
within the module. For example, on an 8-bit address bus, address 01111111 and 
below might reference locations in a memory module (module 0) with 128 words 
of memory, and address 10000000 and above refer to devices attached to an I/O 
module (module 1). 

The control lines are used to control the access to and the use of the data and 
address lines. Because the data and address lines are shared by all components, 
there must be a means of controlling their use. Control signals transmit both com- 
mand and timing information among system modules. Timing signals indicate the 
validity of data and address information. Command signals specify operations to be 
performed. Typical control lines include 


e Memory write: Causes data on the bus to be written into the addressed location 
e Memory read: Causes data from the addressed location to be placed on the bus 
e I/O write: Causes data on the bus to be output to the addressed I/O port 

e I/O read: Causes data from the addressed I/O port to be placed on the bus 


Transfer ACK: Indicates that data have been accepted from or placed on 
the bus 


e Bus request: Indicates that a module needs to gain control of the bus 


e Bus grant: Indicates that a requesting module has been granted control of the bus 


Interrupt request: Indicates that an interrupt is pending 


Interrupt ACK: Acknowledges that the pending interrupt has been recognized 
e Clock: Is used to synchronize operations 


© Reset: Initializes all modules 


The operation of the bus is as follows. If one module wishes to send data to an- 
other, it must do two things: (1) obtain the use of the bus, and (2) transfer data via 
the bus. If one module wishes to request data from another module, it must (1) 
obtain the use of the bus, and (2) transfer a request to the other module over the 
appropriate control and address lines. It must then wait for that second module to 
send the data. 

Physically, the system bus is actually a number of parallel electrical con- 
ductors. In the classic bus arrangement, these conductors are metal lines etched 
in a card or board (printed circuit board). The bus extends across all of the sys- 
tem components, each of which taps into some or all of the bus lines. The classic 
physical arrangement is depicted in Figure 3.17. In this example, the bus consists 
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of two vertical columns of conductors. At regular intervals along the columns, 
there are attachment points in the form of slots that extend out horizontally to 
support a printed circuit board. Each of the major system components occupies 
one or more boards and plugs into the bus at these slots. The entire arrangement 
is housed in a chassis. This scheme can still be used for some of the buses associ- 
ated with a computer system. However, modern systems tend to have all of the 
major components on the same board with more elements on the same chip as 
the processor. Thus, an on-chip bus may connect the processor and cache mem- 
ory, whereas an on-board bus may connect the processor to main memory and 
other components. 

This arrangement is most convenient. A small computer system may be ac- 
quired and then expanded later (more memory, more I/O) by adding more boards. 
If a component on a board fails, that board can easily be removed and replaced. 


Multiple-Bus Hierarchies 


If a great number of devices are connected to the bus, performance will suffer. There 
are two main causes: 


1. In general, the more devices attached to the bus, the greater the bus length and 
hence the greater the propagation delay. This delay determines the time it 
takes for devices to coordinate the use of the bus. When control of the bus 
passes from one device to another frequently, these propagation delays can 
noticeably affect performance. 
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2. The bus may become a bottleneck as the aggregate data transfer demand 
approaches the capacity of the bus. This problem can be countered to some 
extent by increasing the data rate that the bus can carry and by using wider 
buses (e.g., increasing the data bus from 32 to 64 bits). However, because the 
data rates generated by attached devices (e.g., graphics and video controllers, 
network interfaces) are growing rapidly, this is a race that a single bus is ulti- 
mately destined to lose. 


Accordingly, most computer systems use multiple buses, generally laid out in 
a hierarchy. A typical traditional structure is shown in Figure 3.18a. There is a local 
bus that connects the processor to a cache memory and that may support one or 
more local devices. The cache memory controller connects the cache not only to 
this local bus, but to a system bus to which are attached all of the main memory 
modules. As will be discussed in Chapter 4, the use of a cache structure insulates 
the processor from a requirement to access main memory frequently. Hence, main 
memory can be moved off of the local bus onto a system bus. In this way, I/O trans- 
fers to and from the main memory across the system bus do not interfere with the 
processor’s activity. 

It is possible to connect I/O controllers directly onto the system bus. A more 
efficient solution is to make use of one or more expansion buses for this purpose. An 
expansion bus interface buffers data transfers between the system bus and the I/O 
controllers on the expansion bus. This arrangement allows the system to support a 
wide variety of I/O devices and at the same time insulate memory-to-processor traf- 
fic from I/O traffic. 

Figure 3.18a shows some typical examples of I/O devices that might be attached 
to the expansion bus. Network connections include local area networks (LANs) such 
as a 10-Mbps Ethernet and connections to wide area networks (WANs) such as a 
packet-switching network. SCSI (small computer system interface) is itself a type of 
bus used to support local disk drives and other peripherals. A serial port could be 
used to support a printer or scanner. 

This traditional bus architecture is reasonably efficient but begins to break 
down as higher and higher performance is seen in the I/O devices. In response to 
these growing demands, a common approach taken by industry is to build a high- 
speed bus that is closely integrated with the rest of the system, requiring only a 
bridge between the processor’s bus and the high-speed bus. This arrangement is 
sometimes known as a mezzanine architecture. 

Figure 3.18b shows a typical realization of this approach. Again, there is a local 
bus that connects the processor to a cache controller, which is in turn connected to a 
system bus that supports main memory. The cache controller is integrated into a 
bridge, or buffering device, that connects to the high-speed bus. This bus supports 
connections to high-speed LANs, such as Fast Ethernet at 100 Mbps, video and 
graphics workstation controllers, as well as interface controllers to local peripheral 
buses, including SCSI and FireWire. The latter is a high-speed bus arrangement 
specifically designed to support high-capacity I/O devices. Lower-speed devices are 
still supported off an expansion bus, with an interface buffering traffic between the 
expansion bus and the high-speed bus. 

The advantage of this arrangement is that the high-speed bus brings high- 
demand devices into closer integration with the processor and at the same time is 
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Figure 3.18 Example Bus Configurations 


independent of the processor. Thus, differences in processor and high-speed bus 
speeds and signal line definitions are tolerated. Changes in processor architecture 
do not affect the high-speed bus, and vice versa. 


Elements of Bus Design 


Although a variety of different bus implementations exist, there are a few basic pa- 
rameters or design elements that serve to classify and differentiate buses. Table 3.2 
lists key elements. 
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Table 3.2 Elements of Bus Design 


Type Bus Width 
Dedicated Address 
Multiplexed Data 


Method of Arbitration Data Transfer Type 
Centralized Read 
Distributed Write 


Timing Read-modify-write 


Synchronous Read-after-write 


Asynchronous Block 


BUS TYPES Bus lines can be separated into two generic types: dedicated and multi- 
plexed. A dedicated bus line is permanently assigned either to one function or to a 
physical subset of computer components. 

An example of functional dedication is the use of separate dedicated address 
and data lines, which is common on many buses. However, it is not essential. For ex- 
ample, address and data information may be transmitted over the same set of lines 
using an Address Valid control line. At the beginning of a data transfer, the address 
is placed on the bus and the Address Valid line is activated. At this point, each mod- 
ule has a specified period of time to copy the address and determine if it is the ad- 
dressed module. The address is then removed from the bus, and the same bus 
connections are used for the subsequent read or write data transfer. This method of 
using the same lines for multiple purposes is known as time multiplexing. 

The advantage of time multiplexing is the use of fewer lines, which saves space 
and, usually, cost. The disadvantage is that more complex circuitry is needed within 
each module. Also, there is a potential reduction in performance because certain 
events that share the same lines cannot take place in parallel. 

Physical dedication refers to the use of multiple buses, each of which connects 
only a subset of modules. A typical example is the use of an I/O bus to interconnect 
all I/O modules; this bus is then connected to the main bus through some type of I/O 
adapter module. The potential advantage of physical dedication is high throughput, 
because there is less bus contention. A disadvantage is the increased size and cost of 
the system. 


METHOD OF ARBITRATION In all but the simplest systems, more than one module 
may need control of the bus. For example, an I/O module may need to read or write 
directly to memory, without sending the data to the processor. Because only one 
unit at a time can successfully transmit over the bus, some method of arbitration is 
needed. The various methods can be roughly classified as being either centralized or 
distributed. In a centralized scheme, a single hardware device, referred to as a bus 
controller or arbiter, is responsible for allocating time on the bus. The device may be 
a separate module or part of the processor. In a distributed scheme, there is no cen- 
tral controller. Rather, each module contains access control logic and the modules 
act together to share the bus. With both methods of arbitration, the purpose is to 
designate one device, either the processor or an I/O module, as master. The master 
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may then initiate a data transfer (e.g., read or write) with some other device, which 
acts as slave for this particular exchange. 


TIMING Timing refers to the way in which events are coordinated on the bus. Buses 
use either synchronous timing or asynchronous timing. 

With synchronous timing, the occurrence of events on the bus is determined 
by a clock. The bus includes a clock line upon which a clock transmits a regular se- 
quence of alternating 1s and 0s of equal duration. A single 1-0 transmission is re- 
ferred to as a clock cycle or bus cycle and defines a time slot. All other devices on 
the bus can read the clock line, and all events start at the beginning of a clock 
cycle. Figure 3.19 shows a typical, but simplified, timing diagram for synchronous 
read and write operations (see Appendix 3A for a description of timing dia- 
grams). Other bus signals may change at the leading edge of the clock signal (with 
a slight reaction delay). Most events occupy a single clock cycle. In this simple ex- 
ample, the processor places a memory address on the address lines during the first 


l 
Clock Ff ~~ FJ \7F 
Status : 
li Status signals 
ines 


Stable address 


i} 
i} 
Address | 
lines ; 
i} 
i} 
i} 
i} 
i} 


1 i} 
i} i} 
Address / \ | 
enable ! 
i} i} i} 
i} i} i} 
pat $< valid data in >— ! Valid data in 
lines I 1 
Read ! ! 
cycle ! a 
Read : ! 
i} 
i} 
| 
t 
i} 
i} 
i} 
i} 
i} 
i} 
i} 
i} 
i} 
i} 


I 
I 
Dat 
tins ——1——K Valid data out 
Write ! : 
! I 
oe cee ee oe ee a 
! I 
Write | 
i] 
l 
l 
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clock cycle and may assert various status lines. Once the address lines have stabi- 
lized, the processor issues an address enable signal. For a read operation, the 
processor issues a read command at the start of the second cycle. A memory mod- 
ule recognizes the address and, after a delay of one cycle, places the data on the 
data lines. The processor reads the data from the data lines and drops the read sig- 
nal. For a write operation, the processor puts the data on the data lines at the start 
of the second cycle, and issues a write command after the data lines have stabi- 
lized. The memory module copies the information from the data lines during the 
third clock cycle. 

With asynchronous timing, the occurrence of one event on a bus follows and 
depends on the occurrence of a previous event. In the simple read example of 
Figure 3.20a, the processor places address and status signals on the bus. After 
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Figure 3.20 Timing of Asynchronous Bus Operations 
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pausing for these signals to stabilize, it issues a read command, indicating the pres- 
ence of valid address and control signals. The appropriate memory decodes the ad- 
dress and responds by placing the data on the data line. Once the data lines have 
stabilized, the memory module asserts the acknowledged line to signal the proces- 
sor that the data are available. Once the master has read the data from the data 
lines, it deasserts the read signal. This causes the memory module to drop the data 
and acknowledge lines. Finally, once the acknowledge line is dropped, the master 
removes the address information. 

Figure 3.20b shows a simple asynchronous write operation. In this case, the 
master places the data on the data line at the same time that is puts signals on the 
status and address lines. The memory module responds to the write command by 
copying the data from the data lines and then asserting the acknowledge line. The 
master then drops the write signal and the memory module drops the acknowl- 
edge signal. 

Synchronous timing is simpler to implement and test. However, it is less flexi- 
ble than asynchronous timing. Because all devices on a synchronous bus are tied to 
a fixed clock rate, the system cannot take advantage of advances in device perfor- 
mance. With asynchronous timing, a mixture of slow and fast devices, using older 
and newer technology, can share a bus. 


BUS WIDTH We have already addressed the concept of bus width. The width of the 
data bus has an impact on system performance: The wider the data bus, the greater 
the number of bits transferred at one time. The width of the address bus has an im- 
pact on system capacity: the wider the address bus, the greater the range of locations 
that can be referenced. 


DATA TRANSFER TYPE Finally, a bus supports various data transfer types, as illus- 
trated in Figure 3.21. All buses support both write (master to slave) and read (slave 
to master) transfers. In the case of a multiplexed address/data bus, the bus is first 
used for specifying the address and then for transferring the data. For a read opera- 
tion, there is typically a wait while the data are being fetched from the slave to be 
put on the bus. For either a read or a write, there may also be a delay if it is necessary 
to go through arbitration to gain control of the bus for the remainder of the opera- 
tion (i.e., seize the bus to request a read or write, then seize the bus again to perform 
a read or write). 

In the case of dedicated address and data buses, the address is put on the ad- 
dress bus and remains there while the data are put on the data bus. For a write oper- 
ation, the master puts the data onto the data bus as soon as the address has 
stabilized and the slave has had the opportunity to recognize its address. For a read 
operation, the slave puts the data onto the data bus as soon as it has recognized its 
address and has fetched the data. 

There are also several combination operations that some buses allow. A 
read—modify—write operation is simply a read followed immediately by a write to 
the same address. The address is only broadcast once at the beginning of the 
operation. The whole operation is typically indivisible to prevent any access to 
the data element by other potential bus masters. The principal purpose of this 
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Figure 3.21 Bus Data Transfer Types 


capability is to protect shared memory resources in a multiprogramming system 


(see Chapter 8). 


Read-after-write is an indivisible operation consisting of a write followed im- 
mediately by a read from the same address. The read operation may be performed 


for checking purposes. 


Some bus systems also support a block data transfer. In this case, one address 
cycle is followed by n data cycles. The first data item is transferred to or from the 
specified address; the remaining data items are transferred to or from subsequent 


addresses. 
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3.5 PCI 


The peripheral component interconnect (PCI) is a popular high-bandwidth, 
processor-independent bus that can function as a mezzanine or peripheral bus. 
Compared with other common bus specifications, PCI delivers better system per- 
formance for high-speed I/O subsystems (e.g., graphic display adapters, network 
interface controllers, disk controllers, and so on). The current standard allows the 
use of up to 64 data lines at 66 MHz, for a raw transfer rate of 528 MByte/s, or 
4.224 Gbps. But it is not just a high speed that makes PCI attractive. PCI is specif- 
ically designed to meet economically the I/O requirements of modern systems; it 
requires very few chips to implement and supports other buses attached to the 
PCI bus. 

Intel began work on PCI in 1990 for its Pentium-based systems. Intel soon re- 
leased all the patents to the public domain and promoted the creation of an industry 
association, the PCI Special Interest Group (SIG), to develop further and maintain 
the compatibility of the PCI specifications. The result is that PCI has been widely 
adopted and is finding increasing use in personal computer, workstation, and server 
systems. Because the specification is in the public domain and is supported by a 
broad cross section of the microprocessor and peripheral industry, PCI products 
built by different vendors are compatible. 

PCI is designed to support a variety of microprocessor-based configurations, 
including both single- and multiple-processor systems. Accordingly, it provides a 
general-purpose set of functions. It makes use of synchronous timing and a central- 
ized arbitration scheme. 

Figure 3.22a shows a typical use of PCI in a single-processor system. A com- 
bined DRAM controller and bridge to the PCI bus provides tight coupling with the 
processor and the ability to deliver data at high speeds. The bridge acts as a data 
buffer so that the speed of the PCI bus may differ from that of the processor’s I/O 
capability. In a multiprocessor system (Figure 3.22b), one or more PCI configura- 
tions may be connected by bridges to the processor’s system bus. The system bus 
supports only the processor/cache units, main memory, and the PCI bridges. Again, 
the use of bridges keeps the PCI independent of the processor speed yet provides 
the ability to receive and deliver data rapidly. 


Bus Structure 


PCI may be configured as a 32- or 64-bit bus. Table 3.3 defines the 49 mandatory sig- 
nal lines for PCI. These are divided into the following functional groups: 


e System pins: Include the clock and reset pins. 


e Address and data pins: Include 32 lines that are time multiplexed for ad- 
dresses and data. The other lines in this group are used to interpret and vali- 
date the signal lines that carry the addresses and data. 


e Interface control pins: Control the timing of transactions and provide coordi- 
nation among initiators and targets. 
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e Arbitration pins: Unlike the other PCI signal lines, these are not shared lines. 
Rather, each PCI master has its own pair of arbitration lines that connect it di- 


rectly to the PCI bus arbiter. 


e Error reporting pins: Used to report parity and other errors. 


Designation 


Mandatory PCI Signal Lines 


Description 


System Pins 


CLK 


Provides timing for all transactions and is sampled by all inputs on the rising edge. 
Clock rates up to 33 MHz are supported. 


RST# 


Forces all PCI-specific registers, sequencers, and signals to an initialized state. 


Address and Data Pins 


ADJ[31::0] 


Multiplexed lines used for address and data 


C/BE[3::0]# 


Multiplexed bus command and byte enable signals. During the data phase, the lines 
indicate which of the four byte lanes carry meaningful data. 


PAR 


Provides even parity across AD and C/BE lines one clock cycle later. The master 
drives PAR for address and write data phases; the target drive PAR for read data 
phases. 


Interface Control Pins 


FRAME# 


Driven by current master to indicate the start and duration of a transaction. It is as- 
serted at the start and deasserted when the initiator is ready to begin the final data 
phase. 


Initiator Ready. Driven by current bus master (initiator of transaction). During a 
read, indicates that the master is prepared to accept data; during a write, indicates 
that valid data are present on AD. 


Target Ready. Driven by the target (selected device). During a read, indicates that 
valid data are present on AD; during a write, indicates that target is ready to accept 
data. 


STOP# 


Indicates that current target wishes the initiator to stop the current transaction. 


IDSEL 


Initialization Device Select. Used as a chip select during configuration read and 
write transactions. 


DEVSEL# 


Device Select. Asserted by target when it has recognized its address. Indicates to cur- 
rent initiator whether any device has been selected. 


Arbitration Pins 


Indicates to the arbiter that this device requires use of the bus. This is a device- 
specific point-to-point line. 


Indicates to the device that the arbiter has granted bus access. This is a device- 
specific point-to-point line. 


Error Reporting Pins 


Parity Error. Indicates a data parity error is detected by a target during a write data 
phase or by an initiator during a read data phase. 


System Error. May be pulsed by any device to report address parity errors and 
critical errors other than parity. 


In addition, the PCI specification defines 51 optional signal lines (Table 3.4), 


divided into the following functional groups: 


e 


Interrupt pins: These are provided for PCI devices that must generate re- 
quests for service. As with the arbitration pins, these are not shared lines. 
Rather, each PCI device has its own interrupt line or lines to an interrupt 
controller. 
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Table 3.4 Optional PCI Signal Lines 


Designation Description 


Interrupt Pins 


Used to request an interrupt. 


Used to request an interrupt; only has meaning on a multifunction device. 


Used to request an interrupt; only has meaning on a multifunction device. 


Used to request an interrupt; only has meaning on a multifunction device. 


Cache Support Pins 


in/out | Snoop Backoff. Indicates a hit to a modified line. 


in/out | Snoop Done. Indicates the status of the snoop for the current access. Asserted when 
snoop has been completed. 


64-Bit Bus Extension Pins 
AD[63::32] Multiplexed lines used for address and data to extend bus to 64 bits. 


C/BE[7::4]# Multiplexed bus command and byte enable signals. During the address phase, the 
lines provide additional bus commands. During the data phase, the lines indicate 
which of the four extended byte lanes carry meaningful data. 


REQ64# Used to request 64-bit transfer. 
ACK64# Indicates target is willing to perform 64-bit transfer. 


PAR64 Provides even parity across extended AD and C/BE lines one clock cycle later. 
JTAG/Boundary Scan Pins 


TCK i Test clock. Used to clock state information and test data into and out of the device 
during boundary scan. 


TDI i Test input. Used to serially shift test data and instructions into the device. 


TDO Test output. Used to serially shift test data and instructions out of the device. 


TMS i Test mode Select. Used to control state of test access port controller. 


Test reset. Used to initialize test access port controller. 


in Input-only signal 

out Output-only signal 

t/s Bidirectional, tri-state, I/O signal 

s/t/s Sustained tri-state signal driven by only one owner at a time 
o/d Open drain: allows multiple devices to share as a wire-OR 
# Signal’s active state occurs at low voltage 


e Cache support pins: These pins are needed to support a memory on PCI that 
can be cached in the processor or another device. These pins support snoopy 
cache protocols (see Chapter 18 for a discussion of such protocols). 


e 64-bit bus extension pins: Include 32 lines that are time multiplexed for ad- 
dresses and data and that are combined with the mandatory address/data lines 
to form a 64-bit address/data bus. Other lines in this group are used to interpret 
and validate the signal lines that carry the addresses and data. Finally, there are 
two lines that enable two PCI devices to agree to the use of the 64-bit capability. 


e JTAG/boundary scan pins: These signal lines support testing procedures de- 
fined in IEEE Standard 1149.1. 
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PCI Commands 


Bus activity occurs in the form of transactions between an initiator, or master, and a 
target. When a bus master acquires control of the bus, it determines the type of 
transaction that will occur next. During the address phase of the transaction, the 
C/BE lines are used to signal the transaction type. The commands are as follows: 


e Interrupt Acknowledge 
e Special Cycle 

e I/O Read 

e I/O Write 

e Memory Read 

e Memory Read Line 

e Memory Read Multiple 
e Memory Write 

e Memory Write and Invalidate 
e Configuration Read 

e Configuration Write 

e Dual address Cycle 


Interrupt Acknowledge is a read command intended for the device that func- 
tions as an interrupt controller on the PCI bus. The address lines are not used during 
the address phase, and the byte enable lines indicate the size of the interrupt identi- 
fier to be returned. 

The Special Cycle command is used by the initiator to broadcast a message to 
one or more targets. 

The I/O Read and Write commands are used to transfer data between the initia- 
tor and an I/O controller. Each I/O device has its own address space, and the address 
lines are used to indicate a particular device and to specify the data to be transferred 
to or from that device. The concept of I/O addresses is explored in Chapter 7. 

The memory read and write commands are used to specify the transfer of a 
burst of data, occupying one or more clock cycles. The interpretation of these com- 
mands depends on whether or not the memory controller on the PCI bus supports 
the PCI protocol for transfers between memory and cache. If so, the transfer of data 
to and from the memory is typically in terms of cache lines, or blocks.* The three 
memory read commands have the uses outlined in Table 3.5. The Memory Write 
command is used to transfer data in one or more data cycles to memory. The Mem- 
ory Write and Invalidate command transfers data in one or more cycles to memory. 
In addition, it guarantees that at least one cache line is written. This command sup- 
ports the cache function of writing back a line to memory. 

The two configuration commands enable a master to read and update configu- 
ration parameters in a device connected to the PCI. Each PCI device may include 


3The fundamental principles of cache memory are described in Chapter 4; bus-based cache protocols are 
described in Chapter 17. 
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Table 3.5 Interpretation of PCI Read Commands 


Read Command Type For Cachable Memory For Noncachable Memory 


Memory Read Bursting one-half or less of a Bursting 2 data transfer cycles 
cache line or less 


Memory Read Line Bursting more than one-half a Bursting 3 to 12 data transfers 
cache line to three cache lines 


Memory Read Multiple Bursting more than three cache Bursting more than 12 data 
lines transfers 


up to 256 internal registers that are used during system initialization to configure 
that device. 

The Dual Address Cycle command is used by an initiator to indicate that it is 
using 64-bit addressing. 


Data Transfers 


Every data transfer on the PCI bus is a single transaction consisting of one address 
phase and one or more data phases. In this discussion, we illustrate a typical read 
operation; a write operation proceeds similarly. 

Figure 3.23 shows the timing of the read transaction. All events are synchro- 
nized to the falling transitions of the clock, which occur in the middle of each clock 
cycle. Bus devices sample the bus lines on the rising edge at the beginning of a bus 
cycle. The following are the significant events, labeled on the diagram: 


a. Once a bus master has gained control of the bus, it may begin the transaction 
by asserting FRAME. This line remains asserted until the initiator is ready to 
complete the last data phase. The initiator also puts the start address on the ad- 
dress bus, and the read command on the C/BE lines. 


b. Atthe start of clock 2, the target device will recognize its address on the AD lines. 


c. The initiator ceases driving the AD bus. A turnaround cycle (indicated by the 
two circular arrows) is required on all signal lines that may be driven by more 
than one device, so that the dropping of the address signal will prepare the bus 
for use by the target device. The initiator changes the information on the C/BE 
lines to designate which AD lines are to be used for transfer for the currently 
addressed data (from 1 to 4 bytes). The initiator also asserts IRDY to indicate 
that it is ready for the first data item. 


d. The selected target asserts DEVSEL to indicate that it has recognized its ad- 
dress and will respond. It places the requested data on the AD lines and as- 
serts TRDY to indicate that valid data are present on the bus. 


e. The initiator reads the data at the beginning of clock 4 and changes the byte 
enable lines as needed in preparation for the next read. 


f. In this example, the target needs some time to prepare the second block of data 
for transmission. Therefore, it deasserts TRDY to signal the initiator that there 
will not be new data during the coming cycle. Accordingly, the initiator does not 
read the data lines at the beginning of the fifth clock cycle and does not change 
byte enable during that cycle. The block of data is read at beginning of clock 6. 
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g. During clock 6, the target places the third data item on the bus. However, in 
this example, the initiator is not yet ready to read the data item (e.g., it has a 
temporary buffer full condition). It therefore deasserts IRDY. This will cause 
the target to maintain the third data item on the bus for an extra clock cycle. 


h. The initiator knows that the third data transfer is the last, and so it deasserts 
FRAME to signal the target that this is the last data transfer. It also asserts 
IRDY to signal that it is ready to complete that transfer. 


i. The initiator deasserts IRDY, returning the bus to the idle state, and the target 
deasserts TRDY and DEVSEL. 


Arbitration 


PCI makes use of a centralized, synchronous arbitration scheme in which each mas- 
ter has a unique request (REQ) and grant (GNT) signal. These signal lines are at- 
tached to a central arbiter (Figure 3.24) and a simple request-grant handshake is 
used to grant access to the bus. 

The PCI specification does not dictate a particular arbitration algorithm. The 
arbiter can use a first-come-first-served approach, a round-robin approach, or some 
sort of priority scheme. A PCI master must arbitrate for each transaction that it 
wishes to perform, where a single transaction consists of an address phase followed 
by one or more contiguous data phases. 

Figure 3.25 is an example in which devices A and B are arbitrating for the bus. 
The following sequence occurs: 


a. At some point prior to the start of clock 1, A has asserted its REQ signal. The 
arbiter samples this signal at the beginning of clock cycle 1. 

b. During clock cycle 1, B requests use of the bus by asserting its REQ signal. 

c. At the same time, the arbiter asserts GNT-A to grant bus access to A. 

d. Bus master A samples GNT-A at the beginning of clock 2 and learns that it has 
been granted bus access. It also finds IRDY and TRDY deasserted, indicating 
that the bus is idle. Accordingly, it asserts FRAME and places the address 
information on the address bus and the command on the C/BE bus (not 
shown). It also continues to assert REQ-A, because it has a second transaction 
to perform after this one. 
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Figure 3.24 PCI Bus Arbiter 
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e. The bus arbiter samples all REQ lines at the beginning of clock 3 and makes 
an arbitration decision to grant the bus to B for the next transaction. It then 
asserts GNT-B and deasserts GNT-A. B will not be able to use the bus until it 
returns to an idle state. 

f. A deasserts FRAME to indicate that the last (and only) data transfer is in 
progress. It puts the data on the data bus and signals the target with IRDY. The 
target reads the data at the beginning of the next clock cycle. 

g. At the beginning of clock 5, B finds IRDY and FRAME deasserted and so is 
able to take control of the bus by asserting FRAME. It also deasserts its REQ 
line, because it only wants to perform one transaction. 


Subsequently, master A is granted access to the bus for its next transaction. 
Notice that arbitration can take place at the same time that the current bus 
master is performing a data transfer. Therefore, no bus cycles are lost in performing 


arbitration. This is referred to as hidden arbitration. 


3.6 RECOMMENDED READING AND WEB SITES 


The clearest book-length description of PCI is [SHAN99]. [ABBO04] also contains a lot of 


solid information on PCI. 


ABBO04 Abbot, D. PCI Bus Demystified. New York: Elsevier, 2004. 


SHAN99 Shanley, T., and Anderson, D. PCI Systems Architecture. Richardson, TX: 


Mindshare Press, 1999. 


Recommended Web sites: 


e PCI Special Interest Group: Information about PCI specifications and products 
e PCI Pointers: Links to PCI vendors and other sources of information 


3.7 KEY TERMS, REVIEW QUESTIONS, AND PROBLEMS 


Key Terms 


address bus 
asynchronous timing 
bus 

bus arbitration 


bus width 

centralized arbitration 
data bus 

disabled interrupt 


distributed arbitration 
instruction cycle 
instruction execute 
instruction fetch 
interrupt 

interrupt handler 
interrupt service routine 


memory address register 
(MAR) 

memory buffer register (MBR) 

peripheral component 
interconnect (PCI) 

synchronous timing 

system bus 
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Review Questions 


3.1 
3.2 
3.3 
3.4 


What general categories of functions are specified by computer instructions? 

List and briefly define the possible states that define an instruction execution. 

List and briefly define two approaches to dealing with multiple interrupts. 

What types of transfers must a computer’s interconnection structure (e.g., bus) 
support? 

What is the benefit of using a multiple-bus architecture compared to a single-bus 
architecture? 

List and briefly define the functional groups of signal lines for PCI. 


Problems 


3.1 


3.2 


The hypothetical machine of Figure 3.4 also has two I/O instructions: 


0011 = Load AC from I/O 
0011 = Store AC to I/O 


In these cases, the 12-bit address identifies a particular I/O device. Show the program 
execution (using the format of Figure 3.5) for the following program: 

1. Load AC from device 5. 

2. Add contents of memory location 940. 

3. Store AC to device 6. 


Assume that the next value retrieved from device 5 is 3 and that location 940 contains 
a value of 2. 


The program execution of Figure 3.5 is described in the text using six steps. Expand 
this description to show the use of the MAR and MBR. 


Consider a hypothetical 32-bit microprocessor having 32-bit instructions composed of 
two fields: the first byte contains the opcode and the remainder the immediate 
operand or an operand address. 
a. What is the maximum directly addressable memory capacity (in bytes)? 
b. Discuss the impact on the system speed if the microprocessor bus has 

1. a 32-bit local address bus and a 16-bit local data bus, or 

2. a 16-bit local address bus and a 16-bit local data bus. 
c. How many bits are needed for the program counter and the instruction register? 


Consider a hypothetical microprocessor generating a 16-bit address (for example, as- 
sume that the program counter and the address registers are 16 bits wide) and having 
a 16-bit data bus. 

a. What is the maximum memory address space that the processor can access di- 
rectly if it is connected to a “16-bit memory”? 

b. What is the maximum memory address space that the processor can access di- 
rectly if it is connected to an “8-bit memory”? 

c. What architectural features will allow this microprocessor to access a separate 
“I/O space”? 

d. If an input and an output instruction can specify an 8-bit I/O port number, how 
many 8-bit I/O ports can the microprocessor support? How many 16-bit I/O 
ports? Explain. 

Consider a 32-bit microprocessor, with a 16-bit external data bus, driven by an 

8-MHz input clock. Assume that this microprocessor has a bus cycle whose minimum 

duration equals four input clock cycles. What is the maximum data transfer rate 
across the bus that this microprocessor can sustain, in bytes/s? To increase its perfor- 
mance, would it be better to make its external data bus 32 bits or to double the exter- 
nal clock frequency supplied to the microprocessor? State any other assumptions 
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Bus 


terminator 


Figure 3.26 


you make, and explain. Hint: Determine the number of bytes that can be transferred 
per bus cycle. 

Consider a computer system that contains an I/O module controlling a simple key- 
board/printer teletype. The following registers are contained in the processor and 
connected directly to the system bus: 


INPR: Input Register, 8 bits 
OUTR: Output Register, 8 bits 
FGI: Input Flag, 1 bit 

FGO: Output Flag, 1 bit 
TIEN: Interrupt Enable, 1 bit 


Keystroke input from the teletype and printer output to the teletype are controlled 

by the I/O module. The teletype is able to encode an alphanumeric symbol to an 8-bit 

word and decode an 8-bit word into an alphanumeric symbol. 

a. Describe how the processor, using the first four registers listed in this problem, 
can achieve I/O with the teletype. 

b. Describe how the function can be performed more efficiently by also employing TEN. 


Consider two microprocessors having 8- and 16-bit-wide external data buses, re- 
spectively. The two processors are identical otherwise and their bus cycles take just 
as long. 
a. Suppose all instructions and operands are two bytes long. By what factor do the 
maximum data transfer rates differ? 
b. Repeat assuming that half of the operands and instructions are one byte long. 
Figure 3.26 indicates a distributed arbitration scheme that can be used with an obso- 
lete bus scheme known as Multibus I. Agents are daisy-chained physically in priority 
order. The left-most agent in the diagram receives a constant bus priority in (BPRN) 
signal indicating that no higher-priority agent desires the bus. If the agent does not re- 
quire the bus, it asserts its bus priority out (BPRO) line. At the beginning of a clock 
cycle, any agent can request control of the bus by lowering its BPRO line. This lowers 
the BPRN line of the next agent in the chain, which is in turn required to lower its 
BPRO line. Thus, the signal is propagated the length of the chain. At the end of this 
chain reaction, there should be only one agent whose BPRN is asserted and whose 
BPRO is not. This agent has priority. If, at the beginning of a bus cycle, the bus is not 
busy (BUSY inactive), the agent that has priority may seize control of the bus by as- 
serting the BUSY line. 

It takes a certain amount of time for the BPR signal to propagate from the 
highest-priority agent to the lowest. Must this time be less than the clock cycle? Explain. 
The VAX SBI bus uses a distributed, synchronous arbitration scheme. Each SBI 
device (i.e., processor, memory, I/O module) has a unique priority and is assigned a 
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terminator 
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unique transfer request (TR) line. The SBI has 16 such lines (TRO, TR1,.. .,TR15), 
with TRO having the highest priority. When a device wants to use the bus, it places 
a reservation for a future time slot by asserting its TR line during the current time 
slot. At the end of the current time slot, each device with a pending reservation 
examines the TR lines; the highest-priority device with a reservation uses the next 
time slot. 

A maximum of 17 devices can be attached to the bus. The device with priority 
16 has no TR line. Why not? 


On the VAX SBI, the lowest-priority device usually has the lowest average wait time. 
For this reason, the processor is usually given the lowest priority on the SBI. Why 
does the priority 16 device usually have the lowest average wait time? Under what 
circumstances would this not be true? 


For a synchronous read operation (Figure 3.19), the memory module must place the 

data on the bus sufficiently ahead of the falling edge of the Read signal to allow for 

signal settling. Assume a microprocessor bus is clocked at 10 MHz and that the Read 

signal begins to fall in the middle of the second half of T3. 

a. Determine the length of the memory read instruction cycle. 

b. When, at the latest, should memory data be placed on the bus? Allow 20 ns for the 
settling of data lines. 


Consider a microprocessor that has a memory read timing as shown in Figure 3.19. 
After some analysis, a designer determines that the memory falls short of providing 
read data on time by about 180 ns. 

a. How many wait states (clock cycles) need to be inserted for proper system opera- 
tion if the bus clocking rate is 8 MHz? 

b. To enforce the wait states, a Ready status line is employed. Once the processor has 
issued a Read command, it must wait until the Ready line is asserted before at- 
tempting to read data. At what time interval must we keep the Ready line low in 
order to force the processor to insert the required number of wait states? 


A microprocessor has a memory write timing as shown in Figure 3.19. Its manufac- 
turer specifies that the width of the Write signal can be determined by T — 50, where 
T is the clock period in ns. 

a. What width should we expect for the Write signal if bus clocking rate is 5 MHz? 

b. The data sheet for the microprocessor specifies that the data remain valid for 
20 ns after the falling edge of the Write signal. What is the total duration of valid 
data presentation to memory? 

c. How many wait states should we insert if memory requires valid data presentation 
for at least 190 ns? 

A microprocessor has an increment memory direct instruction, which adds 1 to the 

value in a memory location. The instruction has five stages: fetch opcode (four bus 

clock cycles), fetch operand address (three cycles), fetch operand (three cycles), add 1 

to operand (three cycles), and store operand (three cycles). 

a. By what amount (in percent) will the duration of the instruction increase if we 
have to insert two bus wait states in each memory read and memory write 
operation? 

b. Repeat assuming that the increment operation takes 13 cycles instead of 3 cycles. 

The Intel 8088 microprocessor has a read bus timing similar to that of Figure 3.19, but 

requires four processor clock cycles. The valid data is on the bus for an amount of 

time that extends into the fourth processor clock cycle. Assume a processor clock rate 
of 8 MHz. 

a. What is the maximum data transfer rate? 

b. Repeat but assume the need to insert one wait state per byte transferred. 

The Intel 8086 is a 16-bit processor similar in many ways to the 8-bit 8088. The 8086 

uses a 16-bit bus that can transfer 2 bytes at a time, provided that the lower-order 

byte has an even address. However, the 8086 allows both even- and odd-aligned 
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word operands. If an odd-aligned word is referenced, two memory cycles, each con- 
sisting of four bus cycles, are required to transfer the word. Consider an instruction 
on the 8086 that involves two 16-bit operands. How long does it take to fetch the 
operands? Give the range of possible answers. Assume a clocking rate of 4 MHz and 
no wait states. 


3.17 Consider a 32-bit microprocessor whose bus cycle is the same duration as that of a 16- 
bit microprocessor. Assume that, on average, 20% of the operands and instructions 
are 32 bits long, 40% are 16 bits long, and 40% are only 8 bits long. Calculate the im- 
provement achieved when fetching instructions and operands with the 32-bit micro- 
processor. 


3.18 The microprocessor of Problem 3.14 initiates the fetch operand stage of the incre- 
ment memory direct instruction at the same time that a keyboard actives an interrupt 
request line. After how long does the processor enter the interrupt processing cycle? 
Assume a bus clocking rate of 10 MHz. 


3.19 Draw and explain a timing diagram for a PCI write operation (similar to Fig- 
ure 3.23). 


APPENDIX 3A TIMING DIAGRAMS 


In this chapter, timing diagrams are used to illustrate sequences of events and de- 
pendencies among events. For the reader unfamiliar with timing diagrams, this ap- 
pendix provides a brief explanation. 

Communication among devices connected to a bus takes place along a set of 
lines capable of carrying signals. Two different signal levels (voltage levels), repre- 
senting binary 0 and binary 1, may be transmitted. A timing diagram shows the 
signal level on a line as a function of time (Figure 3.27a). By convention, the 
binary 1 signal level is depicted as a higher level than that of binary 0. Usually, bi- 
nary 0 is the default value. That is, if no data or other signal is being transmitted, 
then the level on a line is that which represents binary 0. A signal transition from 
0 to 1 is frequently referred to as the signal’s leading edge; a transition from 1 to 0 
is referred to as a trailing edge. Such transitions are not instantaneous, but this 
transition time is usually small compared with the duration of a signal level. For 
clarity, the transition is usually depicted as an angled line that exaggerates the rel- 
ative amount of time that the transition takes. Occasionally, you will see diagrams 
that use vertical lines, which incorrectly suggests that the transition is instanta- 
neous. On a timing diagram, it may happen that a variable or at least irrelevant 
amount of time elapses between events of interest. This is depicted by a gap in the 
time line. 

Signals are sometimes represented in groups (Figure 3.27b). For example, if 
data are transferred a byte at a time, then eight lines are required. Generally, it is not 
important to know the exact value being transferred on such a group, but rather 
whether signals are present or not. 

A signal transition on one line may trigger an attached device to make signal 
changes on other lines. For example, if a memory module detects a read control 
signal (0 or 1 transition), it will place data signals on the data lines. Such cause-and- 
effect relationships produce sequences of events. Arrows are used on timing dia- 
grams to show these dependencies (Figure 3.27c). 
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Figure 3.27 Timing Diagrams 


In Figure 3.27c, the overbar over the signal name indicates that the signal is ac- 
tive low as shown. For example, Command is active, or asserted, at 0 volts. This 
means that Command = 0 is interpreted as logical 1, or true. 

A clock line is often part of a system bus. An electronic clock is connected to 
the clock line and provides a repetitive, regular sequence of transitions (Fig- 
ure 3.27d). Other events may be synchronized to the clock signal. 
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KEY POINTS 


® Computer memory is organized into a hierarchy. At the highest level (clos- 
est to the processor) are the processor registers. Next comes one or more 
levels of cache, When multiple levels are used, they are denoted L1, L2, and 
so on. Next comes main memory, which is usually made out of dynamic 
random-access memory (DRAM). All of these are considered internal to 
the computer system. The hierarchy continues with external memory, with 
the next level typically being a fixed hard disk, and one or more levels 
below that consisting of removable media such as optical disks and tape. 


® As one goes down the memory hierarchy, one finds decreasing cost/bit, in- 
creasing capacity, and slower access time. It would be nice to use only the 
fastest memory, but because that is the most expensive memory, we trade 
off access time for cost by using more of the slower memory. The design 
challenge is to organize the data and programs in memory so that the ac- 
cessed memory words are usually in the faster memory. 


® In general, it is likely that most future accesses to main memory by the 
processor will be to locations recently accessed. So the cache automatically 
retains a copy of some of the recently used words from the DRAM. If the 
cache is designed properly, then most of the time the processor will request 
memory words that are already in the cache. 


Although seemingly simple in concept, computer memory exhibits perhaps the widest 
range of type, technology, organization, performance, and cost of any feature of a com- 
puter system. No one technology is optimal in satisfying the memory requirements for 
a computer system. As a consequence, the typical computer system is equipped with a 
hierarchy of memory subsystems, some internal to the system (directly accessible by 
the processor) and some external (accessible by the processor via an I/O module). 

This chapter and the next focus on internal memory elements, while Chapter 6 is 
devoted to external memory. To begin, the first section examines key characteristics of 
computer memories. The remainder of the chapter examines an essential element of all 
modern computer systems: cache memory. 


4.1 COMPUTER MEMORY SYSTEM OVERVIEW 


Characteristics of Memory Systems 


The complex subject of computer memory is made more manageable if we classify 
memory systems according to their key characteristics. The most important of these 
are listed in Table 4.1. 

The term location in Table 4.1 refers to whether memory is internal and exter- 
nal to the computer. Internal memory is often equated with main memory. But there 
are other forms of internal memory. The processor requires its own local memory, in 
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Table 4.1 Key Characteristics of Computer Memory Systems 


Location Performance 


Internal (e.g. processor registers, main Access time 
memory, cache) 


Cycle time 
External (e.g. optical disks, magnetic 
disks, tapes) 


Transfer rate 
c * Physical Type 
apaci 
Lait Semiconductor 
Number of words Me § 
agnetic 


Number of bytes : 
Optical 


Unit of Transfer 
Word 
Block 
Access Method 


Sequential 


Magneto-optical 
Physical Characteristics 
Volatile/nonvolatile 
Erasable/nonerasable 
; Organization 
Direct 
Memory modules 
Random 


Associative 


the form of registers (e.g., see Figure 2.3). Further, as we shall see, the control unit 
portion of the processor may also require its own internal memory. We will defer dis- 
cussion of these latter two types of internal memory to later chapters. Cache is 
another form of internal memory. External memory consists of peripheral storage 
devices, such as disk and tape, that are accessible to the processor via I/O controllers. 

An obvious characteristic of memory is its capacity. For internal memory, this 
is typically expressed in terms of bytes (1 byte = 8 bits) or words. Common word 
lengths are 8, 16, and 32 bits. External memory capacity is typically expressed in 
terms of bytes. 

A related concept is the unit of transfer. For internal memory, the unit of 
transfer is equal to the number of electrical lines into and out of the memory 
module. This may be equal to the word length, but is often larger, such as 64, 128, or 
256 bits. To clarify this point, consider three related concepts for internal memory: 


e Word: The “natural” unit of organization of memory. The size of the word is 
typically equal to the number of bits used to represent an integer and to the in- 
struction length. Unfortunately, there are many exceptions. For example, the 
CRAY C90 (an older model CRAY supercomputer) has a 64-bit word length 
but uses a 46-bit integer representation. The Intel x86 architecture has a wide 
variety of instruction lengths, expressed as multiples of bytes, and a word size 
of 32 bits. 


e Addressable units: In some systems, the addressable unit is the word. How- 
ever, many systems allow addressing at the byte level. In any case, the rela- 
tionship between the length in bits A of an address and the number N of 
addressable units is 24 = N. 


e Unit of transfer: For main memory, this is the number of bits read out of or 
written into memory at a time. The unit of transfer need not equal a word or an 
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addressable unit. For external memory, data are often transferred in much 
larger units than a word, and these are referred to as blocks. 


Another distinction among memory types is the method of accessing units of 
data. These include the following: 


e Sequential access: Memory is organized into units of data, called records. Ac- 
cess must be made in a specific linear sequence. Stored addressing information 
is used to separate records and assist in the retrieval process. A shared read— 
write mechanism is used, and this must be moved from its current location to 
the desired location, passing and rejecting each intermediate record. Thus, the 
time to access an arbitrary record is highly variable. Tape units, discussed in 
Chapter 6, are sequential access. 


e Direct access: As with sequential access, direct access involves a shared 
read—write mechanism. However, individual blocks or records have a unique 
address based on physical location. Access is accomplished by direct access to 
reach a general vicinity plus sequential searching, counting, or waiting to reach 
the final location. Again, access time is variable. Disk units, discussed in 
Chapter 6, are direct access. 


e Random access: Each addressable location in memory has a unique, physically 
wired-in addressing mechanism. The time to access a given location is inde- 
pendent of the sequence of prior accesses and is constant. Thus, any location 
can be selected at random and directly addressed and accessed. Main memory 
and some cache systems are random access. 


e Associative: This is a random access type of memory that enables one to make 
a comparison of desired bit locations within a word for a specified match, and 
to do this for all words simultaneously. Thus, a word is retrieved based on a 
portion of its contents rather than its address. As with ordinary random-access 
memory, each location has its own addressing mechanism, and retrieval time is 
constant independent of location or prior access patterns. Cache memories 
may employ associative access. 


From a user’s point of view, the two most important characteristics of memory 
are capacity and performance. Three performance parameters are used: 


e Access time (latency): For random-access memory, this is the time it takes to 
perform a read or write operation, that is, the time from the instant that an ad- 
dress is presented to the memory to the instant that data have been stored or 
made available for use. For non-random-access memory, access time is the 
time it takes to position the read—write mechanism at the desired location. 


e Memory cycle time: This concept is primarily applied to random-access mem- 
ory and consists of the access time plus any additional time required before a 
second access can commence. This additional time may be required for tran- 
sients to die out on signal lines or to regenerate data if they are read destruc- 
tively. Note that memory cycle time is concerned with the system bus, not the 
processor. 


e Transfer rate: This is the rate at which data can be transferred into or out of a 
memory unit. For random-access memory, it is equal to 1/(cycle time). 
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For non-random-access memory, the following relationship holds: 


n 


R (4.1) 


where 


Ty = Average time to read or write N bits 
T, = Average access time 
n = Number of bits 


R = Transfer rate, in bits per second (bps) 


A variety of physical types of memory have been employed. The most com- 
mon today are semiconductor memory, magnetic surface memory, used for disk and 
tape, and optical and magneto-optical. 

Several physical characteristics of data storage are important. In a volatile 
memory, information decays naturally or is lost when electrical power is switched off. 
In a nonvolatile memory, information once recorded remains without deterioration 
until deliberately changed; no electrical power is needed to retain information. 
Magnetic-surface memories are nonvolatile. Semiconductor memory may be either 
volatile or nonvolatile. Nonerasable memory cannot be altered, except by destroying 
the storage unit. Semiconductor memory of this type is known as read-only memory 
(ROM). Of necessity, a practical nonerasable memory must also be nonvolatile. 

For random-access memory, the organization is a key design issue. By organi- 
zation is meant the physical arrangement of bits to form words. The obvious 
arrangement is not always used, as is explained in Chapter 5. 


The Memory Hierarchy 


The design constraints on a computer’s memory can be summed up by three ques- 
tions: How much? How fast? How expensive? 

The question of how much is somewhat open ended. If the capacity is there, 
applications will likely be developed to use it. The question of how fast is, in a sense, 
easier to answer. To achieve greatest performance, the memory must be able to keep 
up with the processor. That is, as the processor is executing instructions, we would 
not want it to have to pause waiting for instructions or operands. The final question 
must also be considered. For a practical system, the cost of memory must be reason- 
able in relationship to other components. 

As might be expected, there is a trade-off among the three key characteristics 
of memory: namely, capacity, access time, and cost. A variety of technologies are 
used to implement memory systems, and across this spectrum of technologies, the 
following relationships hold: 


e Faster access time, greater cost per bit 
e Greater capacity, smaller cost per bit 
e Greater capacity, slower access time 
The dilemma facing the designer is clear. The designer would like to use mem- 


ory technologies that provide for large-capacity memory, both because the capacity 
is needed and because the cost per bit is low. However, to meet performance 
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requirements, the designer needs to use expensive, relatively lower-capacity memo- 
ries with short access times. 


The way out of this dilemma is not to rely on a single memory component or 


technology, but to employ a memory hierarchy. A typical hierarchy is illustrated in 
Figure 4.1. As one goes down the hierarchy, the following occur: 


a. 


b 
c. 
d 


Decreasing cost per bit 


. Increasing capacity 


Increasing access time 


. Decreasing frequency of access of the memory by the processor 


Thus, smaller, more expensive, faster memories are supplemented by larger, 


cheaper, slower memories. The key to the success of this organization is item (d): 
decreasing frequency of access. We examine this concept in greater detail when we 
discuss the cache, later in this chapter, and virtual memory in Chapter 8. A brief 
explanation is provided at this point. 


Figure 4.1 The Memory Hierarchy 
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Example 4.1 Suppose that the processor has access to two levels of memory. Level 1 
contains 1000 words and has an access time of 0.01 ys; level 2 contains 100,000 words and 
has an access time of 0.1 ys. Assume that if a word to be accessed is in level 1, then the 
processor accesses it directly. If it is in level 2, then the word is first transferred to level 1 
and then accessed by the processor. For simplicity, we ignore the time required for the 
processor to determine whether the word is in level 1 or level 2. Figure 4.2 shows the gen- 
eral shape of the curve that covers this situation. The figure shows the average access 
time to a two-level memory as a function of the hit ratio H, where H is defined as the 
fraction of all memory accesses that are found in the faster memory (e.g., the cache), 7 is 


the access time to level 1, and T> is the access time to level 2.' As can be seen, for high 
percentages of level 1 access, the average total access time is much closer to that of level 
1 than that of level 2. 


In our example, suppose 95% of the memory accesses are found in the cache. Then 
the average time to access a word can be expressed as 


(0.95)(0.01 ys) + (0.05)(0.01 us + 0.1 us) = 0.0095 + 0.0055 = 0.015 ps 


The average access time is much closer to 0.01 ys than to 0.1 ps, as desired. 


T, + T, 


Average access time 


Fraction of accesses involving only level 1 (hit ratio) 


Figure 4.2 Performance of accesses involving only 
level 1 (hit ratio) 


'Tf the accessed word is found in the faster memory, that is defined as a hit. A miss occurs if the accessed 
word is not found in the faster memory. 
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The use of two levels of memory to reduce average access time works in prin- 
ciple, but only if conditions (a) through (d) apply. By employing a variety of tech- 
nologies, a spectrum of memory systems exists that satisfies conditions (a) through 
(c). Fortunately, condition (d) is also generally valid. 

The basis for the validity of condition (d) is a principle known as locality of 
reference [DENN68]. During the course of execution of a program, memory refer- 
ences by the processor, for both instructions and data, tend to cluster. Programs typ- 
ically contain a number of iterative loops and subroutines. Once a loop or subroutine 
is entered, there are repeated references to a small set of instructions. Similarly, 
operations on tables and arrays involve access to a clustered set of data words. Over 
a long period of time, the clusters in use change, but over a short period of time, the 
processor is primarily working with fixed clusters of memory references. 

Accordingly, it is possible to organize data across the hierarchy such that the 
percentage of accesses to each successively lower level is substantially less than that of 
the level above. Consider the two-level example already presented. Let level 2 mem- 
ory contain all program instructions and data. The current clusters can be temporarily 
placed in level 1. From time to time, one of the clusters in level 1 will have to be 
swapped back to level 2 to make room for a new cluster coming in to level 1. On aver- 
age, however, most references will be to instructions and data contained in level 1. 

This principle can be applied across more than two levels of memory, as sug- 
gested by the hierarchy shown in Figure 4.1. The fastest, smallest, and most expen- 
sive type of memory consists of the registers internal to the processor. Typically, a 
processor will contain a few dozen such registers, although some machines contain 
hundreds of registers. Skipping down two levels, main memory is the principal inter- 
nal memory system of the computer. Each location in main memory has a unique 
address. Main memory is usually extended with a higher-speed, smaller cache. The 
cache is not usually visible to the programmer or, indeed, to the processor. It is a de- 
vice for staging the movement of data between main memory and processor regis- 
ters to improve performance. 

The three forms of memory just described are, typically, volatile and employ 
semiconductor technology. The use of three levels exploits the fact that semiconduc- 
tor memory comes in a variety of types, which differ in speed and cost. Data are 
stored more permanently on external mass storage devices, of which the most com- 
mon are hard disk and removable media, such as removable magnetic disk, tape, and 
optical storage. External, nonvolatile memory is also referred to as secondary mem- 
ory or auxiliary memory. These are used to store program and data files and are usu- 
ally visible to the programmer only in terms of files and records, as opposed to 
individual bytes or words. Disk is also used to provide an extension to main memory 
known as virtual memory, which is discussed in Chapter 8. 

Other forms of memory may be included in the hierarchy. For example, large 
IBM mainframes include a form of internal memory known as expanded storage. 
This uses a semiconductor technology that is slower and less expensive than that of 
main memory. Strictly speaking, this memory does not fit into the hierarchy but is a 
side branch: Data can be moved between main memory and expanded storage but 
not between expanded storage and external memory. Other forms of secondary 
memory include optical and magneto-optical disks. Finally, additional levels can be 
effectively added to the hierarchy in software. A portion of main memory can be 
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used as a buffer to hold data temporarily that is to be read out to disk. Such a tech- 
nique, sometimes referred to as a disk cache,” improves performance in two ways: 


e Disk writes are clustered. Instead of many small transfers of data, we have a 
few large transfers of data. This improves disk performance and minimizes 
processor involvement. 


e Some data destined for write-out may be referenced by a program before the 
next dump to disk. In that case, the data are retrieved rapidly from the soft- 
ware cache rather than slowly from the disk. 


Appendix 4A examines the performance implications of multilevel memory 
structures. 


4.2 CACHE MEMORY PRINCIPLES 


Cache memory is intended to give memory speed approaching that of the fastest 
memories available, and at the same time provide a large memory size at the price 
of less expensive types of semiconductor memories. The concept is illustrated in 
Figure 4.3a. There is a relatively large and slow main memory together with a 
smaller, faster cache memory. The cache contains a copy of portions of main mem- 
ory. When the processor attempts to read a word of memory, a check is made to 


Block Transfer 


Word Transfer 


Main memory 


(a) Single cache 


Main 
memory 


(b) Three-level cache organization 


Figure 4.3 Cache and Main Memory 


*Disk cache is generally a purely software technique and is not examined in this book. See [STAL09] for 
a discussion. 
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determine if the word is in the cache. If so, the word is delivered to the processor. If 
not, a block of main memory, consisting of some fixed number of words, is read into 
the cache and then the word is delivered to the processor. Because of the phenome- 
non of locality of reference, when a block of data is fetched into the cache to satisfy 
a single memory reference, it is likely that there will be future references to that 
same memory location or to other words in the block. 

Figure 4.3b depicts the use of multiple levels of cache. The L2 cache is slower 
and typically larger than the L1 cache, and the L3 cache is slower and typically 
larger than the L2 cache. 

Figure 4.4 depicts the structure of a cache/main-memory system. Main memory 
consists of up to 2” addressable words, with each word having a unique n-bit address. 
For mapping purposes, this memory is considered to consist of a number of fixed- 
length blocks of K words each. That is, there are M = 2”/K blocks in main memory. 
The cache consists of m blocks, called lines.* Each line contains K words, plus a tag of 
a few bits. Each line also includes control bits (not shown), such as a bit to indicate 


Line Memory 
number Tag Block address 
0 0 
1 1 
2 2 Block 
3 (K words) 
Ga 
Block length 
(K Words) . 
(a) Cache : 
Block 
2] 
Word 
length 


(b) Main memory 


Figure 4.4 Cache/Main Memory Structure 


3In referring to the basic unit of the cache, the term /ine is used, rather than the term block, for two rea- 
sons: (1) to avoid confusion with a main memory block, which contains the same number of data words as 
a cache line; and (2) because a cache line includes not only K words of data, just as a main memory block, 
but also include tag and control bits. 
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whether the line has been modified since being loaded into the cache. The length of 
a line, not including tag and control bits, is the line size. The line size may be as small 
as 32 bits, with each “word” being a single byte; in this case the line size is 4 bytes. 
The number of lines is considerably less than the number of main memory blocks 
(m << M). At any time, some subset of the blocks of memory resides in lines in the 
cache. If a word in a block of memory is read, that block is transferred to one of the 
lines of the cache. Because there are more blocks than lines, an individual line can- 
not be uniquely and permanently dedicated to a particular block. Thus, each line in- 
cludes a tag that identifies which particular block is currently being stored. The tag 
is usually a portion of the main memory address, as described later in this section. 

Figure 4.5 illustrates the read operation. The processor generates the read ad- 
dress (RA) of a word to be read. If the word is contained in the cache, it is delivered 


START 


Receive address 
RA from CPU 


| 


Access main 
memory for block 
containing RA 


= 


Is block 
containing RA 
in cache? 


Fetch RA word Allocate cache 
and deliver line for main 
to CPU memory block 


Load main 
memory block 


Deliver RA word 


into cache line to CPU 


Figure 4.5 Cache Read Operation 
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Control Control 
Processor b>, Cache 


System bus 


Data <q 


Figure 4.6 Typical Cache Organization 


to the processor. Otherwise, the block containing that word is loaded into the cache, 
and the word is delivered to the processor. Figure 4.5 shows these last two opera- 
tions occurring in parallel and reflects the organization shown in Figure 4.6, which is 
typical of contemporary cache organizations. In this organization, the cache con- 
nects to the processor via data, control, and address lines. The data and address lines 
also attach to data and address buffers, which attach to a system bus from which 
main memory is reached. When a cache hit occurs, the data and address buffers are 
disabled and communication is only between processor and cache, with no system 
bus traffic. When a cache miss occurs, the desired address is loaded onto the system 
bus and the data are returned through the data buffer to both the cache and the 
processor. In other organizations, the cache is physically interposed between the 
processor and the main memory for all data, address, and control lines. In this latter 
case, for a cache miss, the desired word is first read into the cache and then trans- 
ferred from cache to processor. 

A discussion of the performance parameters related to cache use is contained 
in Appendix 4A. 


4.3 ELEMENTS OF CACHE DESIGN 


This section provides an overview of cache design parameters and reports some typ- 
ical results. We occasionally refer to the use of caches in high-performance comput- 
ing (HPC). HPC deals with supercomputers and supercomputer software, especially 
for scientific applications that involve large amounts of data, vector and matrix 
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Table 4.2 Elements of Cache Design 


Cache Addresses Write Policy 
Logical Write through 
Physical Write back 
Cache Size Write once 
Mapping Function Line Size 
Direct Number of caches 
Associative Single or two level 


Set Associative Unified or split 


Replacement Algorithm 
Least recently used (LRU) 
First in first out (FIFO) 
Least frequently used (LFU) 


Random 


computation, and the use of parallel algorithms. Cache design for HPC is quite dif- 
ferent than for other hardware platforms and applications. Indeed, many researchers 
have found that HPC applications perform poorly on computer architectures that 
employ caches [BAIL93]. Other researchers have since shown that a cache hierar- 
chy can be useful in improving performance if the application software is tuned to 
exploit the cache [WANG99, PRES01].4 

Although there are a large number of cache implementations, there are a few 
basic design elements that serve to classify and differentiate cache architectures. 
Table 4.2 lists key elements. 


Cache Addresses 


Almost all nonembedded processors, and many embedded processors, support vir- 
tual memory, a concept discussed in Chapter 8. In essence, virtual memory is a facil- 
ity that allows programs to address memory from a logical point of view, without 
regard to the amount of main memory physically available. When virtual memory is 
used, the address fields of machine instructions contain virtual addresses. For reads 
to and writes from main memory, a hardware memory management unit (MMU) 
translates each virtual address into a physical address in main memory. 

When virtual addresses are used, the system designer may choose to place 
the cache between the processor and the MMU or between the MMU and main 
memory (Figure 4.7). A logical cache, also known as a virtual cache, stores data 
using virtual addresses. The processor accesses the cache directly, without going 
through the MMU. A physical cache stores data using main memory physical 
addresses. 

One obvious advantage of the logical cache is that cache access speed is faster 
than for a physical cache, because the cache can respond before the MMU performs 


‘For a general discussion of HPC, see [DOWD98]. 
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Logical address Physical address 


MMU 
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Cache y 
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(a) Logical cache 
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Processor peniar 
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Data | 


(b) Physical cache 


Figure 4.7 Logical and Physical Caches 


an address translation. The disadvantage has to do with the fact that most virtual 
memory systems supply each application with the same virtual memory address 
space. That is, each application sees a virtual memory that starts at address 0. Thus, 
the same virtual address in two different applications refers to two different physical 
addresses. The cache memory must therefore be completely flushed with each appli- 
cation context switch, or extra bits must be added to each line of the cache to iden- 
tify which virtual address space this address refers to. 

The subject of logical versus physical cache is a complex one, and beyond the 
scope of this book. For a more in-depth discussion, see [CEKL97] and [JACO08]. 


Cache Size 


The first item in Table 4.2, cache size, has already been discussed. We would like the 
size of the cache to be small enough so that the overall average cost per bit is close 
to that of main memory alone and large enough so that the overall average access 
time is close to that of the cache alone. There are several other motivations for min- 
imizing cache size. The larger the cache, the larger the number of gates involved in ad- 
dressing the cache. The result is that large caches tend to be slightly slower than small 
ones—even when built with the same integrated circuit technology and put in the 


Processor 


Type 
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Year of 
Introduction 


L1 Cache* 


L2 Cache 


L3 Cache 


IBM 360/85 
PDP-11/70 


Mainframe 


Minicomputer 


1968 


16 to 32 kB 
1kB 


VAX 11/780 


Minicomputer 


16 kB 


IBM 3033 


Mainframe 


64 kB 


IBM 3090 
Intel 80486 


Mainframe 
PC 


128 to 256 kB 
8kB 


Pentium 


PC 


8 kB/8 kB 


256 to 512 KB 


PowerPC 601 


PC 


32 kB 


PowerPC 620 


PC 


32 kB/32 kB 


PowerPC G4 


PC/server 


32 kB/32 kB 


256 KB to 1 MB 


IBM S/390 G4 


Mainframe 


32 kB 


256 KB 


IBM S/390 G6 


Mainframe 


256 kB 


8 MB 


Pentium 4 


PC/server 


8 kB/8 kB 


256 KB 


IBM SP 


CRAY MTA? 


High-end server/ 
supercomputer 


Supercomputer 


64 kB/32 kB 


8kB 


8 MB 


Itanium 


PC/server 


16 kB/16 kB 


4 MB 


SGI Origin 2001 


High-end server 


32 kB/32 kB 


Itanium 2 


PC/server 


32 kB 


6 MB 


IBM POWERS 


High-end server 


64kB 


36 MB 


CRAY XD-1 


Supercomputer 


64 kKB/64 kB 


IBM POWER6 


PC/server 


64 kKB/64 kB 


32 MB 


IBM z10 


Mainframe 


64 kB/128 kB 


* Two values separated by a slash refer to instruction and data caches. 
> Both caches are instruction only; no data caches. 


24-48 MB 


same place on chip and circuit board. The available chip and board area also limits 
cache size. Because the performance of the cache is very sensitive to the nature of 
the workload, it is impossible to arrive at a single “optimum” cache size. Table 4.3 
lists the cache sizes of some current and past processors. 


Mapping Function 


Because there are fewer cache lines than main memory blocks, an algorithm is 
needed for mapping main memory blocks into cache lines. Further, a means is 
needed for determining which main memory block currently occupies a cache line. 
The choice of the mapping function dictates how the cache is organized. Three tech- 
niques can be used: direct, associative, and set associative. We examine each of these 
in turn. In each case, we look at the general structure and then a specific example. 
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Example 4.2 For all three cases, the example includes the following elements: 


e The cache can hold 64 KBytes. 
e Data are transferred between main memory and the cache in blocks of 4 bytes each. 


This means that the cache is organized as 16K = 2" lines of 4 bytes each. 


e The main memory consists of 16 Mbytes, with each byte directly addressable by a 
24-bit address (24 = 16M). Thus, for mapping purposes, we can consider main mem- 
ory to consist of 4M blocks of 4 bytes each. 


DIRECT MAPPING The simplest technique, known as direct mapping, maps each 
block of main memory into only one possible cache line. The mapping is ex- 
pressed as 


i = j modulo m 
where 


i = cache line number 


j = main memory block number 


m = number of lines in the cache 


Figure 4.8a shows the mapping for the first m blocks of main memory. Each 
block of main memory maps into one unique line of the cache. The next m blocks 
of main memory map into the cache in the same fashion; that is, block B,, 
of main memory maps into line Lg of cache, block B,,; maps into line L;, and 
so on. 

The mapping function is easily implemented using the main memory address. 
Figure 4.9 illustrates the general mechanism. For purposes of cache access, each 
main memory address can be viewed as consisting of three fields. The least signifi- 
cant w bits identify a unique word or byte within a block of main memory; in most 
contemporary machines, the address is at the byte level. The remaining s bits specify 
one of the 2° blocks of main memory. The cache logic interprets these s bits as a tag 
of s — r bits (most significant portion) and a line field of r bits. This latter field iden- 
tifies one of the m = 2’ lines of the cache. To summarize, 


e Address length = (s + w) bits 
e Number of addressable units = 2°*” words or bytes 


e Block size = line size = 2" words or bytes 
Qstw 


QW 


e Number of blocks in main memory = = 2° 
e Number of lines in cache = m = 2” 


r+w 


e Size of cache = 2’*” words or bytes 


e Size of tag = (s — r) bits 
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b t b 
—_——————_ <— <_—_—— 
Bo id Lo 
° e & 
e e & 
e e = 
Bt =I Lint 
First m blocks of Cache memory 
main memory 
(equal to size of cache) b = length of block in bits 


t = length of tag in bits 
(a) Direct mapping 


One block of 
main memory 


‘m-1 


Cache memory 
(b) Associative mapping 


Figure 4.8 Mapping from Main Memory to Cache: Direct and Associative 


The effect of this mapping is that blocks of main memory are assigned to lines 
of the cache as follows: 


Cache line Main memory blocks assigned 

0 0,m,2m,...,2° —m 

1 1ljm+1,2m+1,...,2°-—m+1 
m-1 m-—1,2m —1,3m—-1,...,2°-—1 


Thus, the use of a portion of the address as a line number provides a unique 
mapping of each block of main memory into the cache. When a block is actually 
read into its assigned line, it is necessary to tag the data to distinguish it from 
other blocks that can fit into that line. The most significant s — r bits serve 
this purpose. 
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1 if no match 
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Figure 4.9 Direct-Mapping Cache Organization 


Example 4.2a Figure 4.10 shows our example system using direct mapping.° In the ex- 
ample, m = 16K = 2" andi = j modulo 2". The mapping becomes 


Cache Line Starting Memory Address of Block 
0 000000, 010000, ..., FFO000 
000004, 010004, ..., FFO004 


244 OOFFFC, O1FFFC, ..., FFFFFC 


Note that no two blocks that map into the same line number have the same tag num- 
ber. Thus, blocks with starting addresses 000000, 010000, ..., FFO000 have tag numbers 00, 
01, ..., FE respectively. 

Referring back to Figure 4.5, a read operation works as follows. The cache system is 
presented with a 24-bit address. The 14-bit line number is used as an index into the cache 
to access a particular line. If the 8-bit tag number matches the tag number currently stored 
in that line, then the 2-bit word number is used to select one of the 4 bytes in that line. Oth- 
erwise, the 22-bit tag-plus-line field is used to fetch a block from main memory. The actual 
address that is used for the fetch is the 22-bit tag-plus-line concatenated with two 0 bits, so 
that 4 bytes are fetched starting on a block boundary. 


5In this and subsequent figures, memory values are represented in hexadecimal notation. See Chapter 19 
for a basic refresher on number systems (decimal, binary, hexadecimal). 
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Main memory address (binary) 


Tag : 
Ta Line + Word 
(hex) pi eee aa Ny elt Data 
00 '0000,000000000000000000001 | 13579246 }---- 4 
00 0000,000000000000000001 00! 
I 
~~ ww 
I 
I 
I 
00 WUMNPOMOTITMAITMIITI000 
00 POOWWOOOITIIATTig1 Tid 0 I 
e | 
° Line 
° i Tag Data number 
16 '0007,07100000000000000000 | 77777777 ba 00 | 13579246 | 0000 
16 00107TTO0000000000000004') 11235813 -}---------- lim) 112358137) 0001 
16 = 90010710)00710011 00111001] FEDCBA98 }----------] 16 | FEDCBA98 | 0CE7 
ania FF | 11223344 | 3FFE 
16 =(9007)0110)1111711771111100I | 12345678 }---- z ee 16 | 12345678 | 3FFF 
E | —><——_—_—_>> 
E 8bits 32 bits 
FF (T117,T711000000000000,0000! : 
FF '[11117710000000000000100! | POSS TNE Cane 
I 
I 
~~ we 
I 
I 
I 
FF IQQATTTTUTTTI Titi Ti000i | 11223344 -}---- a 
FF ‘TI1td11l111111f1111100! | 24682468 
Note: Memory address values are 
32 bits in binary representation; 


other values are in hexadecimal 
16-MByte main memory 


Tag Line Word 
Main memory address = 
8 bits 14 bits 2 bits 
Figure 4.10 Direct Mapping Example 


The direct mapping technique is simple and inexpensive to implement. Its 
main disadvantage is that there is a fixed cache location for any given block. Thus, if 
a program happens to reference words repeatedly from two different blocks that 
map into the same line, then the blocks will be continually swapped in the cache, and 
the hit ratio will be low (a phenomenon known as thrashing). 


Selective Victim Cache Simulator 
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W3 


Wa 
Wath 
VL» Wrz] CB 
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(Hit in cache) 


1 if match 
0 if no match 


AY 


(Miss in cache) 


Figure 4.11 Fully Associative Cache Organization 


One approach to lower the miss penalty is to remember what was discarded in 
case it is needed again. Since the discarded data has already been fetched, it can be 
used again at a small cost. Such recycling is possible using a victim cache. Victim cache 
was originally proposed as an approach to reduce the conflict misses of direct mapped 
caches without affecting its fast access time. Victim cache is a fully associative cache, 
whose size is typically 4 to 16 cache lines, residing between a direct mapped L1 cache 
and the next level of memory. This concept is explored in Appendix D. 


ASSOCIATIVE MAPPING Associative mapping overcomes the disadvantage of di- 
rect mapping by permitting each main memory block to be loaded into any line of 
the cache (Figure 4.8b). In this case, the cache control logic interprets a memory ad- 
dress simply as a Tag and a Word field. The Tag field uniquely identifies a block of 
main memory. To determine whether a block is in the cache, the cache control logic 
must simultaneously examine every line’s tag for a match. Figure 4.11 illustrates the 
logic. Note that no field in the address corresponds to the line number, so that the 
number of lines in the cache is not determined by the address format. To summarize, 


e Address length = (s + w) bits 


e Number of addressable units = 2°*" 


words or bytes 


e Block size = line size = 2" words or bytes 
gstw 


e Number of blocks in main memory = aw 2; 


e Number of lines in cache = undetermined 


e Size of tag = s bits 
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Example 4.2b Figure 4.12 shows our example using associative mapping. A main mem- 
ory address consists of a 22-bit tag and a 2-bit byte number. The 22-bit tag must be stored 
with the 32-bit block of data for each line in the cache. Note that it is the leftmost (most 
significant) 22 bits of the address that form the tag. Thus, the 24-bit hexadecimal address 
16339C has the 22-bit tag 058CE7. This is easily seen in binary notation: 


memory address 0001 0110 0011 0011 1001 1100 (binary) 
1 6 3 3 9 C (hex) 


tag (leftmost 22 bits) 00 0101 1000 1100 1110 0111 (binary) 
0 5) 8 Cc E 7 (hex) 


Main memory address (binary) 
oo 


Tag (hex) is uke Data 
000000 000000000000000000000000 | 13579246 | ----- 
000001 000000000000,0000000001001 
mw “we 
Line 
Tag Data Number 
(77) 3FFFFE/ 11223344 | 0000 
r—-—1L-—-+--|058CE7| FEDCBA98 | 0001 
~~ ~ ! | 
058CE6 (0001/0110(0011/001110011000 | 
O58CE7 10001011000110011100111001 | FEDCBA98 | -- “~ ~~ 
O58CE8 000101100011,0011)10100000! | 
mw we I 
r-—7-—1—7—713FFFFD| 33333333 | 3FFD 
I ——-l---|000000] 13579246 | 3FFE 
| -—t--13PFFFF| 24682468 | 3FFF 
| | | <—__> «<—___>» 
mot 4 22bits 32 bits 
a l 16K line cache 
hy ote 
I \ | 
I \ | 
~~ ~ I 
oe 
ca | 
3FFFFD (1117117111 7111711110100 | 33333333 --- I 
3FFFFE (I I1UITTIII1111 11111000 | 11223344 }---- tos - 
3FFFFF '111111111111j111111111100! | 24682468 }—-—-- 4 Note: Memory address values are 
in binary representation; 
32 bits other values are in hexadecimal 
16-MByte main memory 
Tag Word 
Main memory address = 
—_ 
22 bits 2 bits 


Figure 4.12 Associative Mapping Example 
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With associative mapping, there is flexibility as to which block to replace when 
a new block is read into the cache. Replacement algorithms, discussed later in this 
section, are designed to maximize the hit ratio. The principal disadvantage of asso- 
ciative mapping is the complex circuitry required to examine the tags of all cache 
lines in parallel. 


Cache Time Analysis Simulator 


SET-ASSOCIATIVE MAPPING Set-associative mapping is a compromise that ex- 
hibits the strengths of both the direct and associative approaches while reducing 
their disadvantages. 

In this case, the cache consists of a number sets, each of which consists of a 
number of lines. The relationships are 


m=vxXk 


i = jmodulov 


where 
i = cache set number 
j = main memory block number 
m = number of lines in the cache 
v = number of sets 
k = number of lines in each set 


This is referred to as k-way set-associative mapping. With set-associative 
mapping, block B; can be mapped into any of the lines of set j. Figure 4.13a illus- 
trates this mapping for the first v blocks of main memory. As with associative map- 
ping, each word maps into multiple cache lines. For set-associative mapping, each 
word maps into all the cache lines in a specific set, so that main memory block Bg 
maps into set 0, and so on. Thus, the set-associative cache can be physically imple- 
mented as v associative caches. It is also possible to implement the set-associative 
cache as k direct mapping caches, as shown in Figure 4.13b. Each direct-mapped 
cache is referred to as a way, consisting of v lines. The first v lines of main memory 
are direct mapped into the v lines of each way; the next group of v lines of main 
memory are similarly mapped, and so on. The direct-mapped implementation is 
typically used for small degrees of associativity (small values of k) while the asso- 
ciative-mapped implementation is typically used for higher degrees of associativ- 
ity [JJACOO8]. 

For set-associative mapping, the cache control logic interprets a memory 
address as three fields: Tag, Set, and Word. The d set bits specify one of v = 24 
sets. The s bits of the Tag and Set fields specify one of the 2° blocks of main mem- 
ory. Figure 4.14 illustrates the cache control logic. With fully associative mapping, 
the tag in a memory address is quite large and must be compared to the tag of 
every line in the cache. With k-way set-associative mapping, the tag in a memory 
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Figure 4.13 Mapping from Main Memory to Cache: k-way Set Associative 


address is much smaller and is only compared to the k tags within a single set. 
To summarize, 


e Address length = (s + w) bits 


e Number of addressable units = 2°*" 


words or bytes 
e Block size = line size = 2" words or bytes 


gstw 


e Number of blocks in main memory = x0 2° 


e Number of lines in set = k 


e Number of sets = vp = 24 
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Figure 4.14 K-Way Set Associative Cache Organization 


e Number of lines in cache = m = kv = k X 24 
e Size of cache = k X 2“*” words or bytes 
e Size of tag = (s — d) bits 


Example 4.2c Figure 4.15 shows our example using set-associative mapping with two 
lines in each set, referred to as two-way set-associative. The 13-bit set number identifies a 
unique set of two lines within the cache. It also gives the number of the block in main 
memory, modulo 2'°. This determines the mapping of blocks into lines. Thus, blocks 
000000, 008000, ..., FF8000 of main memory map into cache set 0. Any of those blocks 


can be loaded into either of the two lines in the set. Note that no two blocks that map into 
the same cache set have the same tag number. For a read operation, the 13-bit set number 
is used to determine which set of two lines is to be examined. Both lines in the set are ex- 


amined for a match with the tag number of the address to be accessed. 


In the extreme case of vy = m, k = 1, the set-associative technique reduces to 
direct mapping, and for vy = 1,k = m, it reduces to associative mapping. The use of 
two lines per set (v = m/2,k = 2) is the most common set-associative organization. 


Main memory address (binary) 


Tag —_—_——— ‘ 
(hex) Tag Set + Word - Main memory address = 
ata Tag Set Word 
000 13579246 +----- 4 
000 
I 
I 
~8 Ag 9 bits 13 bits 2 bits 
I 
000 
I 
owe : poset Ss Seales See eee see ceS se Sees seas S 
e | Set | 
° Tag Data number Tag Data 
02C ' | eer - — —' aS, 000} 13579246 | 0000 |02C) 77777777 ;--- 
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ww wT A aA A aA 
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in binary representation; 
other values are in hexadecimal 
Figure 4.15 Two-Way Set Associative Mapping Example 
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Figure 4.16 Varying Associativity over Cache Size 


It significantly improves the hit ratio over direct mapping. Four-way set associative 
(v = m/4, k = 4) makes a modest additional improvement for a relatively small ad- 
ditional cost [MAYB84, HILL89]. Further increases in the number of lines per set 
have little effect. 

Figure 4.16 shows the results of one simulation study of set-associative cache 
performance as a function of cache size [GENU04]. The difference in performance 
between direct and two-way set associative is significant up to at least a cache size of 
64 kB. Note also that the difference between two-way and four-way at 4 kB is much 
less than the difference in going from for 4 kB to 8 kB in cache size. The complexity 
of the cache increases in proportion to the associativity, and in this case would not 
be justifiable against increasing cache size to 8 or even 16 Kbytes. A final point to 
note is that beyond about 32 kB, increase in cache size brings no significant increase 
in performance. 

The results of Figure 4.16 are based on simulating the execution of a GCC 
compiler. Different applications may yield different results. For example, [CANTO1] 
reports on the results for cache performance using many of the CPU2000 SPEC 
benchmarks. The results of [CANTO1] in comparing hit ratio to cache size follow the 
same pattern as Figure 4.16, but the specific values are somewhat different. 


Cache Simulator 
Multitask Cache Simulator 
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Replacement Algorithms 


Once the cache has been filled, when a new block is brought into the cache, one of 
the existing blocks must be replaced. For direct mapping, there is only one possible 
line for any particular block, and no choice is possible. For the associative and set- 
associative techniques, a replacement algorithm is needed. To achieve high speed, 
such an algorithm must be implemented in hardware. A number of algorithms have 
been tried. We mention four of the most common. Probably the most effective is 
least recently used (LRU): Replace that block in the set that has been in the cache 
longest with no reference to it. For two-way set associative, this is easily imple- 
mented. Each line includes a USE bit. When a line is referenced, its USE bit is set to 
1 and the USE bit of the other line in that set is set to 0. When a block is to be read 
into the set, the line whose USE bit is 0 is used. Because we are assuming that more 
recently used memory locations are more likely to be referenced, LRU should give 
the best hit ratio. LRU is also relatively easy to implement for a fully associative 
cache. The cache mechanism maintains a separate list of indexes to all the lines in 
the cache. When a line is referenced, it moves to the front of the list. For replace- 
ment, the line at the back of the list is used. Because of its simplicity of implementa- 
tion, LRU is the most popular replacement algorithm. 

Another possibility is first-in-first-out (FIFO): Replace that block in the set that 
has been in the cache longest. FIFO is easily implemented as a round-robin or circular 
buffer technique. Still another possibility is least frequently used (LFU): Replace that 
block in the set that has experienced the fewest references. LFU could be imple- 
mented by associating a counter with each line. A technique not based on usage (i.e., 
not LRU, LFU, FIFO, or some variant) is to pick a line at random from among the 
candidate lines. Simulation studies have shown that random replacement provides 
only slightly inferior performance to an algorithm based on usage [SMIT82]. 


Write Policy 


When a block that is resident in the cache is to be replaced, there are two cases to 
consider. If the old block in the cache has not been altered, then it may be overwrit- 
ten with a new block without first writing out the old block. If at least one write op- 
eration has been performed on a word in that line of the cache, then main memory 
must be updated by writing the line of cache out to the block of memory before 
bringing in the new block. A variety of write policies, with performance and eco- 
nomic trade-offs, is possible. There are two problems to contend with. First, more 
than one device may have access to main memory. For example, an I/O module may 
be able to read-write directly to memory. If a word has been altered only in the 
cache, then the corresponding memory word is invalid. Further, if the I/O device has 
altered main memory, then the cache word is invalid. A more complex problem oc- 
curs when multiple processors are attached to the same bus and each processor has 
its own local cache. Then, if a word is altered in one cache, it could conceivably in- 
validate a word in other caches. 

The simplest technique is called write through. Using this technique, all write 
operations are made to main memory as well as to the cache, ensuring that main 
memory is always valid. Any other processor—-cache module can monitor traffic to 
main memory to maintain consistency within its own cache. The main disadvantage 
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of this technique is that it generates substantial memory traffic and may create a 
bottleneck. An alternative technique, known as write back, minimizes memory 
writes. With write back, updates are made only in the cache. When an update occurs, 
a dirty bit, or use bit, associated with the line is set. Then, when a block is replaced, it 
is written back to main memory if and only if the dirty bit is set. The problem with 
write back is that portions of main memory are invalid, and hence accesses by I/O 
modules can be allowed only through the cache. This makes for complex circuitry 
and a potential bottleneck. Experience has shown that the percentage of memory 
references that are writes is on the order of 15% [SMIT82]. However, for HPC ap- 
plications, this number may approach 33% (vector-vector multiplication) and can go 
as high as 50% (matrix transposition). 


Example 4.3. Consider a cache with a line size of 32 bytes and a main memory that re- 
quires 30 ns to transfer a 4-byte word. For any line that is written at least once before 
being swapped out of the cache, what is the average number of times that the line must be 
written before being swapped out for a write-back cache to be more efficient that a write- 
through cache? 


For the write-back case, each dirty line is written back once, at swap-out time, taking 
8 X 30 = 240 ns. For the write-through case, each update of the line requires that one 
word be written out to main memory, taking 30 ns. Therefore, if the average line that gets 
written at least once gets written more than 8 times before swap out, then write back is 
more efficient. 


In a bus organization in which more than one device (typically a processor) 
has a cache and main memory is shared, a new problem is introduced. If data in one 
cache are altered, this invalidates not only the corresponding word in main memory, 
but also that same word in other caches (if any other cache happens to have that 
same word). Even if a write-through policy is used, the other caches may contain in- 
valid data. A system that prevents this problem is said to maintain cache coherency. 
Possible approaches to cache coherency include the following: 


e Bus watching with write through: Each cache controller monitors the address 
lines to detect write operations to memory by other bus masters. If another 
master writes to a location in shared memory that also resides in the cache 
memory, the cache controller invalidates that cache entry. This strategy de- 
pends on the use of a write-through policy by all cache controllers. 


e Hardware transparency: Additional hardware is used to ensure that all up- 
dates to main memory via cache are reflected in all caches. Thus, if one proces- 
sor modifies a word in its cache, this update is written to main memory. In 
addition, any matching words in other caches are similarly updated. 


e Noncacheable memory: Only a portion of main memory is shared by more 
than one processor, and this is designated as noncacheable. In such a system, 
all accesses to shared memory are cache misses, because the shared memory is 
never copied into the cache. The noncacheable memory can be identified using 
chip-select logic or high-address bits. 
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Cache coherency is an active field of research. This topic is explored further in 
Part Five. 


Line Size 


Another design element is the line size. When a block of data is retrieved and placed 
in the cache, not only the desired word but also some number of adjacent words are 
retrieved. As the block size increases from very small to larger sizes, the hit ratio will 
at first increase because of the principle of locality, which states that data in the 
vicinity of a referenced word are likely to be referenced in the near future. As the 
block size increases, more useful data are brought into the cache. The hit ratio will 
begin to decrease, however, as the block becomes even bigger and the probability of 
using the newly fetched information becomes less than the probability of reusing 
the information that has to be replaced. Two specific effects come into play: 


e Larger blocks reduce the number of blocks that fit into a cache. Because each 
block fetch overwrites older cache contents, a small number of blocks results 
in data being overwritten shortly after they are fetched. 


e Asa block becomes larger, each additional word is farther from the requested 
word and therefore less likely to be needed in the near future. 


The relationship between block size and hit ratio is complex, depending on the 
locality characteristics of a particular program, and no definitive optimum value has 
been found. A size of from 8 to 64 bytes seems reasonably close to optimum 
[SMIT87, PRZY88, PRZY90, HAND98]. For HPC systems, 64- and 128-byte cache 
line sizes are most frequently used. 


Number of Caches 


When caches were originally introduced, the typical system had a single cache. More 
recently, the use of multiple caches has become the norm. Two aspects of this design 
issue concern the number of levels of caches and the use of unified versus split caches. 


MULTILEVEL CACHES As logic density has increased, it has become possible to 
have a cache on the same chip as the processor: the on-chip cache. Compared with a 
cache reachable via an external bus, the on-chip cache reduces the processor’s ex- 
ternal bus activity and therefore speeds up execution times and increases overall 
system performance. When the requested instruction or data is found in the on-chip 
cache, the bus access is eliminated. Because of the short data paths internal to the 
processor, compared with bus lengths, on-chip cache accesses will complete appre- 
ciably faster than would even zero-wait state bus cycles. Furthermore, during this 
period the bus is free to support other transfers. 

The inclusion of an on-chip cache leaves open the question of whether an off- 
chip, or external, cache is still desirable. Typically, the answer is yes, and most contem- 
porary designs include both on-chip and external caches. The simplest such 
organization is known as a two-level cache, with the internal cache designated as level 
1 (L1) and the external cache designated as level 2 (L2). The reason for including an 
L2 cache is the following: If there is no L2 cache and the processor makes an access 
request for a memory location not in the L1 cache, then the processor must access 
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DRAM or ROM memory across the bus. Due to the typically slow bus speed and slow 
memory access time, this results in poor performance. On the other hand, if an L2 
SRAM (static RAM) cache is used, then frequently the missing information can be 
quickly retrieved. If the SRAM is fast enough to match the bus speed, then the data 
can be accessed using a zero-wait state transaction, the fastest type of bus transfer. 

Two features of contemporary cache design for multilevel caches are noteworthy. 
First, for an off-chip L2 cache, many designs do not use the system bus as the path 
for transfer between the L2 cache and the processor, but use a separate data path, so 
as to reduce the burden on the system bus. Second, with the continued shrinkage of 
processor components, a number of processors now incorporate the L2 cache on the 
processor chip, improving performance. 

The potential savings due to the use of an L2 cache depends on the hit rates in 
both the L1 and L2 caches. Several studies have shown that, in general, the use of 
a second-level cache does improve performance (e.g., see [AZIM92], [NOVI93], 
[HAND98]). However, the use of multilevel caches does complicate all of the design 
issues related to caches, including size, replacement algorithm, and write policy; see 
[HAND98] and [PEIR99] for discussions. 

Figure 4.17 shows the results of one simulation study of two-level cache per- 
formance as a function of cache size [GENU04]. The figure assumes that both 
caches have the same line size and shows the total hit ratio. That is, a hit is counted if 
the desired data appears in either the L1 or the L2 cache. The figure shows the im- 
pact of L2 on total hits with respect to L1 size. L2 has little effect on the total num- 
ber of cache hits until it is at least double the L1 cache size. Note that the steepest 
part of the slope for an L1 cache of 8 Kbytes is for an L2 cache of 16 Kbytes. Again 
for an L1 cache of 16 Kbytes, the steepest part of the curve is for an L2 cache size of 
32 Kbytes. Prior to that point, the L2 cache has little, if any, impact on total cache 
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Figure 4.17 Total Hit Ratio (L1 and L2) for 8-Kbyte and 16-Kbyte L1 
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performance. The need for the L2 cache to be larger than the L1 cache to affect per- 
formance makes sense. If the L2 cache has the same line size and capacity as the L1 
cache, its contents will more or less mirror those of the L1 cache. 

With the increasing availability of on-chip area available for cache, most con- 
temporary microprocessors have moved the L2 cache onto the processor chip and 
added an L3 cache. Originally, the L3 cache was accessible over the external bus. 
Mote recently, most microprocessors have incorporated an on-chip L3 cache. In ei- 
ther case, there appears to be a performance advantage to adding the third level 
(e.g., see [GHAI98]). 


UNIFIED VERSUS SPLIT CACHES When the on-chip cache first made an appear- 
ance, many of the designs consisted of a single cache used to store references to both 
data and instructions. More recently, it has become common to split the cache into 
two: one dedicated to instructions and one dedicated to data. These two caches both 
exist at the same level, typically as two L1 caches. When the processor attempts to 
fetch an instruction from main memory, it first consults the instruction L1 cache, and 
when the processor attempts to fetch data from main memory, it first consults the 
data L1 cache. 
There are two potential advantages of a unified cache: 


e Fora given cache size, a unified cache has a higher hit rate than split caches be- 
cause it balances the load between instruction and data fetches automatically. 
That is, if an execution pattern involves many more instruction fetches than 
data fetches, then the cache will tend to fill up with instructions, and if an exe- 
cution pattern involves relatively more data fetches, the opposite will occur. 


e Only one cache needs to be designed and implemented. 


Despite these advantages, the trend is toward split caches, particularly for super- 
scalar machines such as the Pentium and PowerPC, which emphasize parallel instruc- 
tion execution and the prefetching of predicted future instructions. The key advantage 
of the split cache design is that it eliminates contention for the cache between the in- 
struction fetch/decode unit and the execution unit. This is important in any design that 
relies on the pipelining of instructions. Typically, the processor will fetch instructions 
ahead of time and fill a buffer, or pipeline, with instructions to be executed. Suppose 
now that we have a unified instruction/data cache. When the execution unit performs 
a memory access to load and store data, the request is submitted to the unified cache. 
If, at the same time, the instruction prefetcher issues a read request to the cache for an 
instruction, that request will be temporarily blocked so that the cache can service the 
execution unit first, enabling it to complete the currently executing instruction. This 
cache contention can degrade performance by interfering with efficient use of the 
instruction pipeline. The split cache structure overcomes this difficulty. 


4.4 PENTIUM 4 CACHE ORGANIZATION 


The evolution of cache organization is seen clearly in the evolution of Intel micro- 
processors (Table 4.4). The 80386 does not include an on-chip cache. The 80486 
includes a single on-chip cache of 8 KBytes, using a line size of 16 bytes and a four-way 


Table 4.4 Intel Cache Evolution 


Problem 


External memory slower than the system 
bus. 
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Solution 


Add external cache using faster 
memory technology. 


Processor on which 
Feature First Appears 


Increased processor speed results in 
external bus becoming a bottleneck for 
cache access. 


Move external cache on-chip, op- 
erating at the same speed as the 
processor. 


Internal cache is rather small, due to 
limited space on chip 


Add external L2 cache using faster 
technology than main memory 


486 


Contention occurs when both the Instruc- 
tion Prefetcher and the Execution Unit 


simultaneously require access to the cache. 


In that case, the Prefetcher is stalled while 
the Execution Unit’s data access takes 
place. 


Create separate data and instruc- 
tion caches. 


Pentium 


Increased processor speed results in 
external bus becoming a bottleneck for L2 
cache access. 


Some applications deal with massive data- 
bases and must have rapid access to large 
amounts of data. The on-chip caches are 
too small. 


Create separate back-side bus that 
runs at higher speed than the main 
(front-side) external bus. The BSB 
is dedicated to the L2 cache. 


Pentium Pro 


Move L2 cache on to the proces- 
sor chip. 


Add external L3 cache. 


Pentium IT 


Pentium III 


Move L3 cache on-chip. 


Pentium 4 


set-associative organization. All of the Pentium processors include two on-chip L1 
caches, one for data and one for instructions. For the Pentium 4, the L1 data cache 
is 16 KBytes, using a line size of 64 bytes and a four-way set-associative organiza- 
tion. The Pentium 4 instruction cache is described subsequently. The Pentium II 
also includes an L2 cache that feeds both of the L1 caches. The L2 cache is eight- 
way set associative with a size of 512 KB and a line size of 128 bytes. An L3 cache 
was added for the Pentium III and became on-chip with high-end versions of the 
Pentium 4. 

Figure 4.18 provides a simplified view of the Pentium 4 organization, high- 
lighting the placement of the three caches. The processor core consists of four major 
components: 


e Fetch/decode unit: Fetches program instructions in order from the L2 cache, 
decodes these into a series of micro-operations, and stores the results in the L1 
instruction cache. 


¢ Out-of-order execution logic: Schedules execution of the micro-operations 
subject to data dependencies and resource availability; thus, micro-operations 
may be scheduled for execution in a different order than they were fetched 
from the instruction stream. As time permits, this unit schedules speculative 
execution of micro-operations that may be required in the future. 


crt 


System bus 
Out-of-order |. L1 instruction Instruction 
execution cache (12K ops) fetch/decode 
logic unit 
64 
bits 
L3 cache 
(1 MB) 
Integer register file FP register file 
Load Store Simple Simple Complex FP/ FP 
address address integer integer integer MMX move 
unit unit ALU ALU ALU unit unit 
L2 cache 
| | | (512 KB) 
L1 data cache (16 KB) eel 


256 


Figure 4.18 Pentium 4 Block Diagram 
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Table 4.5 Pentium 4 Cache Operating Modes 


Control Bits Operating Mode 


NW Cache Fills Write Throughs Invalidates 


Enabled Enabled Enabled 
Disabled Enabled Enabled 
Disabled Disabled Disabled 


Note: CD = 0;NW = 1 is an invalid combination. 


e Execution units: These units executes micro-operations, fetching the required 
data from the L1 data cache and temporarily storing results in registers. 


e Memory subsystem: This unit includes the L2 and L3 caches and the system 
bus, which is used to access main memory when the L1 and L2 caches have a 
cache miss and to access the system I/O resources. 


Unlike the organization used in all previous Pentium models, and in most 
other processors, the Pentium 4 instruction cache sits between the instruction de- 
code logic and the execution core. The reasoning behind this design decision is as 
follows: As discussed more fully in Chapter 14, the Pentium process decodes, or 
translates, Pentium machine instructions into simple RISC-like instructions called 
micro-operations. The use of simple, fixed-length micro-operations enables the use 
of superscalar pipelining and scheduling techniques that enhance performance. 
However, the Pentium machine instructions are cumbersome to decode; they have a 
variable number of bytes and many different options. It turns out that performance 
is enhanced if this decoding is done independently of the scheduling and pipelining 
logic. We return to this topic in Chapter 14. 

The data cache employs a write-back policy: Data are written to main memory 
only when they are removed from the cache and there has been an update. The Pen- 
tium 4 processor can be dynamically configured to support write-through caching. 

The L1 data cache is controlled by two bits in one of the control registers, la- 
beled the CD (cache disable) and NW (not write-through) bits (Table 4.5). There 
are also two Pentium 4 instructions that can be used to control the data cache: 
INVD invalidates (flushes) the internal cache memory and signals the external 
cache (if any) to invalidate. WBINVD writes back and invalidates internal cache 
and then writes back and invalidates external cache. 

Both the L2 and L3 caches are eight-way setassociative with a line size of 
128 bytes. 


4.5 ARM CACHE ORGANIZATION 


The ARM cache organization has evolved with the overall architecture of the ARM 
family, reflecting the relentless pursuit of performance that is the driving force for 
all microprocessor designers. 
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Table 4.6 ARM Cache Features 


Cache Line Write 
Cache Cache Size Size Buffer Size 
Core Type (kB) (words) Associativity | Location (words) 


ARM720T Unified 8 4 4-way Logical 


ARM920T Split 16/16 D/I 64-way Logical 
ARM926EJ-S 4-128/4-128 D/I Logical 


ARM1022E 16/16 D/I Logical 
ARM1026EJ-S 4-128/4-128 D/I Logical 


Intel 
StrongARM 


Intel Xscale 32/32 D/I Logical 
ARM1136-JF-S 4-64/4-64 D/I 8 Physical 


16/16 D/I Logical 


Table 4.6 shows this evolution. The ARM7 models used a unified L1 cache, 
while all subsequent models use a split instruction/data cache. All of the ARM de- 
signs use a set-associative cache, with the degree of associativity and the line size 
varying. ARM cached cores with an MMU use a logical cache for processor families 
ARM7 through ARM10, including the Intel StongARM and Intel Xscale proces- 
sors. The ARM11 family uses a physical cache. The distinction between logical and 
physical cache is discussed earlier in this chapter (Figure 4.7). 

An interesting feature of the ARM architecture is the use of a small first-in- 
first out (FIFO) write buffer to enhance memory write performance. The write 
buffer is interposed between the cache and main memory and consists of a set of ad- 
dresses and a set of data words. The write buffer is small compared to the cache, and 
may hold up to four independent addresses. Typically, the write buffer is enabled for 
all of main memory, although it may be selectively disabled at the page level. Figure 
4.19, taken from [SLOS04], shows the relationship among the write buffer, cache, 
and main memory. 


Word, byte access Block transfer 
Cache 
Fast Slow 


Main 
Write memory 
Fast buffer Slow 


Processor 


Word, byte access 
Slow 


Figure 4.19 ARM Cache and Write Buffer Organization 
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The write buffer operates as follows: When the processor performs a write to a 
bufferable area, the data are placed in the write buffer at processor clock speed and 
the processor continues execution. A write occurs when data in the cache are writ- 
ten back to main memory. Thus, the data to be written are transferred from the 
cache to the write buffer. The write buffer then performs the external write in paral- 
lel. If, however, the write buffer is full (either because there are already the maxi- 
mum number of words of data in the buffer or because there is no slot for the new 
address) then the processor is stalled until there is sufficient space in the buffer. As 
non-write operations proceed, the write buffer continues to write to main memory 
until the buffer is completely empty. 

Data written to the write buffer are not available for reading back into the 
cache until the data have transferred from the write buffer to main memory. This is 
the principal reason that the write buffer is quite small. Even so, unless there is a 
high proportion of writes in an executing program, the write buffer improves 
performance. 


4.6 RECOMMENDED READING 


[JACOO8] is an excellent, up-to-date treatment of cache design. Another thorough treat- 
ment is [HAND98]. A classic paper that is still well worth reading is [SMITS82]; it surveys the 
various elements of cache design and presents the results of an extensive set of analyses. An- 
other interesting classic is [WILK65], which is probably the first paper to introduce the con- 
cept of the cache. [GOOD83] also provides a useful analysis of cache behavior. Another 
worthwhile analysis is [BELL74]. [AGAR89] presents a detailed examination of a variety of 
cache design issues related to multiprogramming and multiprocessing. [HIGB90] provides a 
set of simple formulas that can be used to estimate cache performance as a function of vari- 
ous cache parameters. 


AGAR89_ Agarwal, A. Analysis of Cache Performance for Operating Systems and Multi- 
programming. Boston: Kluwer Academic Publishers, 1989. 

BELL74 Bell, J.; Casasent, D.; and Bell, C. “An Investigation into Alternative Cache Or- 
ganizations.” JEEE Transactions on Computers, April 1974. http://research 
-microsoft.com/users/GBell/gbvita.htm. 

GOOD83 Goodman, J. “Using Cache Memory to Reduce Processor-Memory Band- 
width.” Proceedings, 10th Annual International Symposium on Computer Architec- 
ture, 1983. Reprinted in [HILLOO]. 

HAND98_ Handy, J. The Cache Memory Book. San Diego: Academic Press, 1993. 

HIGB90_Higbie, L. “Quick and Easy Cache Performance Analysis.” Computer Architec- 
ture News, June 1990. 

JACO08 Jacob, B.; Ng, S.; and Wang, D. Memory Systems: Cache, DRAM, Disk. Boston: 
Morgan Kaufmann, 2008. 

SMIT82 Smith, A. “Cache Memories.” ACM Computing Surveys, September 1992. 

WILK65_ Wilkes, M. “Slave Memories and Dynamic Storage Allocation,” JEEE Transac- 
tions on Electronic Computers, April 1965. Reprinted in [HILLOO]. 
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4.7 KEY TERMS, REVIEW QUESTIONS, AND PROBLEMS 


Key Terms 


access time hit ratio sequential access 
associative mapping instruction cache set-associative mapping 
cache hit L1 cache spatial locality 

cache line L2 cache split cache 

cache memory L3 cache tag 

cache miss locality temporal locality 


cache set logical cache unified cache 

data cache memory hierarchy virtual cache 

direct access multilevel cache write back 

direct mapping physical cache write once 

high-performance computing random access write through 
(HPC) replacement algorithm 


Review Questions 


4.1 What are the differences among sequential access, direct access, and random access? 

4.2 What is the general relationship among access time, memory cost, and capacity? 

4.3 How does the principle of locality relate to the use of multiple memory levels? 

4.4 What are the differences among direct mapping, associative mapping, and set- 
associative mapping? 

4.5 For a direct-mapped cache, a main memory address is viewed as consisting of three 
fields. List and define the three fields. 

4.6 For an associative cache, a main memory address is viewed as consisting of two fields. 
List and define the two fields. 

4.7 For a set-associative cache, a main memory address is viewed as consisting of three 
fields. List and define the three fields. 

4.8 What is the distinction between spatial locality and temporal locality? 

4.9 In general, what are the strategies for exploiting spatial locality and temporal locality? 


Problems 


4.1 A set-associative cache consists of 64 lines, or slots, divided into four-line sets. Main 
memory contains 4K blocks of 128 words each. Show the format of main memory 
addresses. 

4.2 A two-way set-associative cache has lines of 16 bytes and a total size of 8 kbytes. The 
64-Mbyte main memory is byte addressable. Show the format of main memory 
addresses. 

4.3 For the hexadecimal main memory addresses 111111, 666666, BBBBBB, show the fol- 
lowing information, in hexadecimal format: 

a. Tag, Line, and Word values for a direct-mapped cache, using the format of Fig- 
ure 4.10 

b. Tag and Word values for an associative cache, using the format of Figure 4.12 

c. Tag, Set, and Word values for a two-way set-associative cache, using the format of 
Figure 4.15 


4.4 


4.6 


4.7 


4.8 


4.9 
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List the following values: 

a. For the direct cache example of Figure 4.10: address length, number of address- 
able units, block size, number of blocks in main memory, number of lines in cache, 
size of tag 

b. For the associative cache example of Figure 4.12: address length, number of ad- 
dressable units, block size, number of blocks in main memory, number of lines in 
cache, size of tag 

c. For the two-way set-associative cache example of Figure 4.15: address 
length, number of addressable units, block size, number of blocks in main 
memory, number of lines in set, number of sets, number of lines in cache, size 
of tag 

Consider a 32-bit microprocessor that has an on-chip 16-KByte four-way set-associa- 

tive cache. Assume that the cache has a line size of four 32-bit words. Draw a block di- 

agram of this cache showing its organization and how the different address fields are 

used to determine a cache hit/miss. Where in the cache is the word from memory lo- 
cation ABCDE8F8 mapped? 

Given the following specifications for an external cache memory: four-way set asso- 

ciative; line size of two 16-bit words; able to accommodate a total of 4K 32-bit words 

from main memory; used with a 16-bit processor that issues 24-bit addresses. Design 
the cache structure with all pertinent information and show how it interprets the 
processor’s addresses. 

The Intel 80486 has an on-chip, unified cache. It contains 8 KBytes and has a four-way 

set-associative organization and a block length of four 32-bit words. The cache is or- 

ganized into 128 sets. There is a single “line valid bit” and three bits, BO, B1, and B2 

(the “LRU” bits), per line. On a cache miss, the 80486 reads a 16-byte line from main 

memory in a bus memory read burst. Draw a simplified diagram of the cache and 

show how the different fields of the address are interpreted. 

Consider a machine with a byte addressable main memory of 2'° bytes and block size 

of 8 bytes. Assume that a direct mapped cache consisting of 32 lines is used with this 

machine. 

a. How is a 16-bit memory address divided into tag, line number, and byte 
number? 

b. Into what line would bytes with each of the following addresses be stored? 


0001 0001 0001 1011 
1100 0011 0011 0100 
1101 0000 0001 1101 
1010 1010 1010 1010 


c. Suppose the byte with address 0001 1010 0001 1010 is stored in the cache. What 
are the addresses of the other bytes stored along with it? 

d. How many total bytes of memory can be stored in the cache? 

e. Why is the tag also stored in the cache? 


For its on-chip cache, the Intel 80486 uses a replacement algorithm referred to 

as pseudo least recently used. Associated with each of the 128 sets of four lines 

(labeled LO, L1, L2, L3) are three bits BO, B1, and B2. The replacement algorithm 

works as follows: When a line must be replaced, the cache will first determine whether 

the most recent use was from LO and L1 or L2 and L3. Then the cache will determine 

which of the pair of blocks was least recently used and mark it for replacement. 

Figure 4.20 illustrates the logic. 

a. Specify how the bits BO, B1, and B2 are set and then describe in words how they 
are used in the replacement ‘algorithm depicted in Figure 4.20. 

b. Show that the 80486 algorithm approximates a true LRU algorithm. Hint: Con- 
sider the case in which the most recent order of usage is LO, L2, L3, L1. 

c. Demonstrate that a true LRU algorithm would require 6 bits per set. 
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All four lines in No Replace 
the set valid? nonvalid line 
Yes | 
BO = 0? 


Yes, LO or L1 
least recently used 


No, L2 or L3 
least recently used 


B1=0? B2=0? 


Replace Replace Replace Replace 
LO Ll L2 L3 


Figure 4.20 Intel 80486 On-Chip Cache Replacement Strategy 


4.10 A set-associative cache has a block size of four 16-bit words and a set size of 2. The 
cache can accommodate a total of 4096 words. The main memory size that is 
cacheable is 64K X 32 bits. Design the cache structure and show how the processor’s 
addresses are interpreted. 


4.11 Consider a memory system that uses a 32-bit address to address at the byte level, plus 

a cache that uses a 64-byte line size. 

a. Assume a direct mapped cache with a tag field in the address of 20 bits. Show 
the address format and determine the following parameters: number of ad- 
dressable units, number of blocks in main memory, number of lines in cache, 
size of tag. 

b. Assume an associative cache. Show the address format and determine the follow- 
ing parameters: number of addressable units, number of blocks in main memory, 
number of lines in cache, size of tag. 

c. Assume a four-way set-associative cache with a tag field in the address of 9 bits. 
Show the address format and determine the following parameters: number of ad- 
dressable units, number of blocks in main memory, number of lines in set, number 
of sets in cache, number of lines in cache, size of tag. 


4.12 Consider a computer with the following characteristics: total of [Mbyte of main mem- 
ory; word size of 1 byte; block size of 16 bytes; and cache size of 64 Kbytes. 
a. For the main memory addresses of FO010, 01234, and CABBE, give the corre- 
sponding tag, cache line address, and word offsets for a direct-mapped cache. 
b. Give any two main memory addresses with different tags that map to the same 
cache slot for a direct-mapped cache. 
c. For the main memory addresses of F0010 and CABBE, give the corresponding 
tag and offset values for a fully-associative cache. 
d. For the main memory addresses of F0010 and CABBE, give the corresponding 
tag, cache set, and offset values for a two-way set-associative cache. 
4.13 Describe a simple technique for implementing an LRU replacement algorithm in a 
four-way set-associative cache. 
4.14 Consider again Example 4.3. How does the answer change if the main memory uses a 
block transfer capability that has a first-word access time of 30 ns and an access time 
of 5 ns for each word thereafter? 


4.15 


4.16 
4.17 


4.18 


4.19. 


4.20 


4.21 


4.22 
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Consider the following code: 
for (i = 0;i < 20;i++) 

for (j = 0;j < 10;j++) 

ali] = afi]* j 

a. Give one example of the spatial locality in the code. 
b. Give one example of the temporal locality in the code. 
Generalize Equations (4.2) and (4.3), in Appendix 4A, to N-level memory hierarchies. 
A computer system contains a main memory of 32K 16-bit words. It also has a 4K- 
word cache divided into four-line sets with 64 words per line. Assume that the cache is 
initially empty. The processor fetches words from locations 0, 1,2, . . ., 4351 in that 
order. It then repeats this fetch sequence nine more times. The cache is 10 times faster 


than main memory. Estimate the improvement resulting from the use of the cache. 
Assume an LRU policy for block replacement. 


Consider a cache of 4 lines of 16 bytes each. Main memory is divided into blocks of 
16 bytes each. That is, block 0 has bytes with addresses 0 through 15, and so on. Now 
consider a program that accesses memory in the following sequence of addresses: 
Once: 63 through 70 

Loop ten times: 15 through 32; 80 through 95 


a. Suppose the cache is organized as direct mapped. Memory blocks 0,4, and so on are 
assigned to line 1; blocks 1,5, and so on to line 2; and so on. Compute the hit ratio. 

b. Suppose the cache is organized as two-way set associative, with two sets of two 
lines each. Even-numbered blocks are assigned to set 0 and odd-numbered blocks 
are assigned to set 1. Compute the hit ratio for the two-way set-associative cache 
using the least recently used replacement scheme. 


Consider a memory system with the following parameters: 


T.=100ns  C.= 10% $/bit 
T,, = 1200ns_ C,, = 10° $/bit 


a. What is the cost of 1 Mbyte of main memory? 

b. What is the cost of 1 Mbyte of main memory using cache memory technology? 

c. Ifthe effective access time is 10% greater than the cache access time, what is the 
hit ratio H? 

a. Consider an L1 cache with an access time of 1 ns and a hit ratio of H = 0.95. Sup- 
pose that we can change the cache design (size of cache, cache organization) such 
that we increase H to 0.97, but increase access time to 1.5 ns. What conditions 
must be met for this change to result in improved performance? 

b. Explain why this result makes intuitive sense. 


Consider a single-level cache with an access time of 2.5 ns, a line size of 64 bytes, and a 

hit ratio of H = 0.95. Main memory uses a block transfer capability that has a first- 

word (4 bytes) access time of 50 ns and an access time of 5 ns for each word thereafter. 

a. What is the access time when there is a cache miss? Assume that the cache waits 
until the line has been fetched from main memory and then re-executes for a hit. 

b. Suppose that increasing the line size to 128 bytes increases the H to 0.97. Does this 
reduce the average memory access time? 


A computer has a cache, main memory, and a disk used for virtual memory. If a refer- 
enced word is in the cache, 20 ns are required to access it. If it is in main memory but 
not in the cache, 60 ns are needed to load it into the cache, and then the reference is 
started again. If the word is not in main memory, 12 ms are required to fetch the word 
from disk, followed by 60 ns to copy it to the cache, and then the reference is started 
again. The cache hit ratio is 0.9 and the main memory hit ratio is 0.6. What is the aver- 
age time in nanoseconds required to access a referenced word on this system? 
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4.23 


4.24 


4.25 


4.26 


4.27 


4.28 


4.29 


Consider a cache with a line size of 64 bytes. Assume that on average 30% of the lines 

in the cache are dirty. A word consists of 8 bytes. 

a. Assume there is a 3% miss rate (0.97 hit ratio). Compute the amount of main 
memory traffic, in terms of bytes per instruction for both write-through and write- 
back policies. Memory is read into cache one line at a time. However, for write 
back, a single word can be written from cache to main memory. 

b. Repeat part a fora 5% rate. 

c. Repeat part a for a 7% rate. 

d. What conclusion can you draw from these results? 


On the Motorola 68020 microprocessor, a cache access takes two clock cycles. Data 

access from main memory over the bus to the processor takes three clock cycles in the 

case of no wait state insertion; the data are delivered to the processor in parallel with 

delivery to the cache. 

a. Calculate the effective length of a memory cycle given a hit ratio of 0.9 and a 
clocking rate of 16.67 MHz. 

b. Repeat the calculations assuming insertion of two wait states of one cycle each 
per memory cycle. What conclusion can you draw from the results? 


Assume a processor having a memory cycle time of 300 ns and an instruction process- 

ing rate of 1 MIPS. On average, each instruction requires one bus memory cycle for 

instruction fetch and one for the operand it involves. 

a. Calculate the utilization of the bus by the processor. 

b. Suppose the processor is equipped with an instruction cache and the associated 
hit ratio is 0.5. Determine the impact on bus utilization. 


The performance of a single-level cache system for a read operation can be charac- 
terized by the following equation: 


T,=T. + (1- WNT, 


where 7, is the average access time, T, is the cache access time, T,,, is the memory ac- 

cess time (memory to processor register), and # is the hit ratio. For simplicity, we as- 

sume that the word in question is loaded into the cache in parallel with the load to 

processor register. This is the same form as Equation (4.2). 

a. Define J, = time to transfer a line between cache and main memory, and W = 
fraction of write references. Revise the preceding equation to account for writes 
as well as reads, using a write-through policy. 

b. Define W, as the probability that a line in the cache has been altered. Provide an 
equation for 7, for the write-back policy. 

For a system with two levels of cache, define 7, = first-level cache access time; T,.. = 

second-level cache access time; 7,,, = memory access time; H, = first-level cache hit 

ratio; Hy = combined first/second level cache hit ratio. Provide an equation for 7, for 

a read operation. 

Assume the following performance characteristics on a cache read miss: one clock 

cycle to send an address to main memory and four clock cycles to access a 32-bit word 

from main memory and transfer it to the processor and cache. 

a. Ifthe cache line size is one word, what is the miss penalty (i.e., additional time re- 
quired for a read in the event of a read miss)? 

b. What is the miss penalty if the cache line size is four words and a multiple, non- 

burst transfer is executed? 

What is the miss penalty if the cache line size is four words and a transfer is exe- 

cuted, with one clock cycle per word transfer? 


For the cache design of the preceding problem, suppose that increasing the line size 
from one word to four words results in a decrease of the read miss rate from 3.2% to 
1.1%. For both the nonburst transfer and the burst transfer case, what is the average 
miss penalty, averaged over all reads, for the two different line sizes? 


2 
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APPENDIX 4A PERFORMANCE CHARACTERISTICS 


OF TWO-LEVEL MEMORIES 


In this chapter, reference is made to a cache that acts as a buffer between main 
memory and processor, creating a two-level internal memory. This two-level archi- 
tecture exploits a property known as locality to provide improved performance over 
a comparable one-level memory. 

The main memory cache mechanism is part of the computer architecture, im- 
plemented in hardware and typically invisible to the operating system. There are two 
other instances of a two-level memory approach that also exploit locality and that 
are, at least partially, implemented in the operating system: virtual memory and the 
disk cache (Table 4.7). Virtual memory is explored in Chapter 8; disk cache is be- 
yond the scope of this book but is examined in [STAL09]. In this appendix, we look 
at some of the performance characteristics of two-level memories that are common 
to all three approaches. 


Locality 


The basis for the performance advantage of a two-level memory is a principle 
known as locality of reference [DENN68]. This principle states that memory refer- 
ences tend to cluster. Over a long period of time, the clusters in use change, but over 
a short period of time, the processor is primarily working with fixed clusters of 
memory references. 

Intuitively, the principle of locality makes sense. Consider the following line of 
reasoning: 


1. Except for branch and call instructions, which constitute only a small fraction 
of all program instructions, program execution is sequential. Hence, in most 
cases, the next instruction to be fetched immediately follows the last instruc- 
tion fetched. 


2. It is rare to have a long uninterrupted sequence of procedure calls followed by 
the corresponding sequence of returns. Rather, a program remains confined to a 


Table 4.7 Characteristics of Two-Level Memories 


Main Memory Virtual Memory 
Cache (paging) Disk Cache 


Typical access time 5:1(mainmemory 10°: 1 (main memory 10° : 1 (main memory 
ratios vs. cache) vs. disk) vs. disk) 


Memory management Implemented by Combination of hardware System software 
system special hardware and system software 

Typical block or page = 4 to 128 bytes 64 to 4096 bytes (virtual 64 to 4096 bytes 

size (cache block) memory page) (disk block or pages) 


Access of processor Direct access Indirect access Indirect access 
to second level 
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Table 4.8 Relative Dynamic Frequency of High-Level Language Operations 


Study [HUCK83] [KNUT71] [PATTS82a] [TANE78] 
Language Pascal FORTRAN Pascal C SAL 
Workload Scientific Student System System System 


Assign 67 
Loop 

Call 

IF 

GOTO 

Other 


rather narrow window of procedure-invocation depth. Thus, over a short period 
of time references to instructions tend to be localized to a few procedures. 


3. Most iterative constructs consist of a relatively small number of instructions re- 
peated many times. For the duration of the iteration, computation is therefore 
confined to a small contiguous portion of a program. 


4. In many programs, much of the computation involves processing data struc- 
tures, such as arrays or sequences of records. In many cases, successive refer- 
ences to these data structures will be to closely located data items. 


This line of reasoning has been confirmed in many studies. With reference to 
point 1, a variety of studies have analyzed the behavior of high-level language pro- 
grams. Table 4.8 includes key results, measuring the appearance of various statement 
types during execution, from the following studies. The earliest study of programming 
language behavior, performed by Knuth [KNUT71], examined a collection of FOR- 
TRAN programs used as student exercises. Tanenbaum [TANE78] published mea- 
surements collected from over 300 procedures used in operating-system programs 
and written in a language that supports structured programming (SAL). Patterson 
and Sequein [PATT82a] analyzed a set of measurements taken from compilers and 
programs for typesetting, computer-aided design (CAD), sorting, and file comparison. 
The programming languages C and Pascal were studied. Huck [HUCK83] analyzed 
four programs intended to represent a mix of general-purpose scientific computing, 
including fast Fourier transform and the integration of systems of differential equa- 
tions. There is good agreement in the results of this mixture of languages and applica- 
tions that branching and call instructions represent only a fraction of statements 
executed during the lifetime of a program. Thus, these studies confirm assertion 1. 

With respect to assertion 2, studies reported in [PATT85a] provide confirma- 
tion. This is illustrated in Figure 4.21, which shows call-return behavior. Each call is 
represented by the line moving down and to the right, and each return by the line 
moving up and to the right. In the figure, a window with depth equal to 5 is defined. 
Only a sequence of calls and returns with a net movement of 6 in either direction 
causes the window to move. As can be seen, the executing program can remain 
within a stationary window for long periods of time. A study by the same analysts of 
C and Pascal programs showed that a window of depth 8 will need to shift only on 
less than 1% of the calls or returns [TAMI83]. 


Return 


Call 
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Figure 4.21 Example Call-Return Behavior of a Program 


A distinction is made in the literature between spatial locality and temporal 
locality. Spatial locality refers to the tendency of execution to involve a number of 
memory locations that are clustered. This reflects the tendency of a processor to ac- 
cess instructions sequentially. Spatial location also reflects the tendency of a pro- 
gram to access data locations sequentially, such as when processing a table of data. 
Temporal locality refers to the tendency for a processor to access memory locations 
that have been used recently. For example, when an iteration loop is executed, the 
processor executes the same set of instructions repeatedly. 

Traditionally, temporal locality is exploited by keeping recently used instruc- 
tion and data values in cache memory and by exploiting a cache hierarchy. Spatial 
locality is generally exploited by using larger cache blocks and by incorporating 
prefetching mechanisms (fetching items of anticipated use) into the cache control 
logic. Recently, there has been considerable research on refining these techniques to 
achieve greater performance, but the basic strategies remain the same. 


Operation of Two-Level Memory 


The locality property can be exploited in the formation of a two-level memory. The 
upper-level memory (M1) is smaller, faster, and more expensive (per bit) than the 
lower-level memory (M2). M1 is used as a temporary store for part of the contents 
of the larger M2. When a memory reference is made, an attempt is made to access 
the item in M1. If this succeeds, then a quick access is made. If not, then a block of 
memory locations is copied from M2 to M1 and the access then takes place via M1. 
Because of locality, once a block is brought into M1, there should be a number of ac- 
cesses to locations in that block, resulting in fast overall service. 

To express the average time to access an item, we must consider not only the 
speeds of the two levels of memory, but also the probability that a given reference 
can be found in M1. We have 


T,=HXT,+(1-H)X(T,+T) 
=T,+(1-H)XT (4.2) 
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where 


T, = average (system) access time 

T, = access time of M1 (e.g., cache, disk cache) 

Ty = access time of M2 (e.g., main memory, disk) 

H = hit ratio (fraction of time reference is found in M1) 


Figure 4.2 shows average access time as a function of hit ratio. As can be seen, 
for a high percentage of hits, the average total access time is much closer to that of 
M1 than M2. 


Performance 


Let us look at some of the parameters relevant to an assessment of a two-level 
memory mechanism. First consider cost. We have 


— CS, + CS, 


4.3 
: S, + S; (2) 


C, = average cost per bit for the combined two-level memory 
= average cost per bit of upper-level memory M1 

Cy = average cost per bit of lower-level memory M2 

size of M1 

S> = size of M2 


We would like C, ~ C,. Given that C, >> C,, this requires S; <_ S$). Figure 
4.22 shows the relationship. 

Next, consider access time. For a two-level memory to provide a significant 
performance improvement, we need to have T, approximately equal to T, (T, ~ 7;). 
Given that T, is much less than T, (T; << T>), a hit ratio of close to 1 is needed. 

So we would like M1 to be small to hold down cost, and large to improve the 
hit ratio and therefore the performance. Is there a size of M1 that satisfies both 
requirements to a reasonable extent? We can answer this question with a series of 
subquestions: 


Ae) 
| 


mn 
II 


e What value of hit ratio is needed so that T, ~ T,? 
e What size of M1 will assure the needed hit ratio? 
e Does this size satisfy the cost requirement? 
To get at this, consider the quantity 7\/T,, which is referred to as the access efficiency. 


It is a measure of how close average access time (7,) is to M1 access time (7;). From 
Equation (4.2), 


T, 1 (4.4) 
T. . 


i 
La Se 
1 


Figure 4.23 plots T,/T, as a function of the hit ratio H, with the quantity 7;/T| as a pa- 
rameter. Typically, on-chip cache access time is about 25 to 50 times faster than main 
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Figure 4.22 Relationship of Average Memory Cost to Relative Memory Size for a Two-Level Memory 
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Figure 4.23 Access Efficiency as a Function of Hit Ratio (r = T/T) 
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Figure 4.24 Hit Ratio as a Function of Relative Memory Size 


memory access time (i.e., 7>/T; is 5 to 10), off-chip cache access time is about 5 to 15 
times faster than main memory access time (i.e., T>/T, is 5 to 15), and main memory 
access time is about 1000 times faster than disk access time (7>/T, = 1000). Thus, a 
hit ratio in the range of near 0.9 would seem to be needed to satisfy the performance 
requirement. 

We can now phrase the question about relative memory size more exactly. Is a 
hit ratio of, say, 0.8 or better reasonable for S$; << _$,? This will depend on a number 
of factors, including the nature of the software being executed and the details of the 
design of the two-level memory. The main determinant is, of course, the degree of lo- 
cality. Figure 4.24 suggests the effect that locality has on the hit ratio. Clearly, if M1 
is the same size as M2, then the hit ratio will be 1.0: All of the items in M2 are always 
stored also in M1. Now suppose that there is no locality; that is, references are com- 
pletely random. In that case the hit ratio should be a strictly linear function of the 
relative memory size. For example, if M1 is half the size of M2, then at any time half 
of the items from M2 are also in M1 and the hit ratio will be 0.5. In practice, how- 
ever, there is some degree of locality in the references. The effects of moderate and 
strong locality are indicated in the figure. Note that Figure 4.24 is not derived from 
any specific data or model; the figure suggests the type of performance that is seen 
with various degrees of locality. 

So if there is strong locality, it is possible to achieve high values of hit ratio 
even with relatively small upper-level memory size. For example, numerous studies 
have shown that rather small cache sizes will yield a hit ratio above 0.75 regardless 
of the size of main memory (e.g., [AGAR89], [PRZY88], [STRE83], and [SMIT82]). 
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A cache in the range of 1K to 128K words is generally adequate, whereas main 
memory is now typically in the gigabyte range. When we consider virtual memory 
and disk cache, we will cite other studies that confirm the same phenomenon, 
namely that a relatively small M1 yields a high value of hit ratio because of locality. 

This brings us to the last question listed earlier: Does the relative size of the 
two memories satisfy the cost requirement? The answer is clearly yes. If we need 
only a relatively small upper-level memory to achieve good performance, then the 
average cost per bit of the two levels of memory will approach that of the cheaper 
lower-level memory. 

Please note that with L2 cache, or even L2 and L3 caches, involved, analysis is 
much more complex. See [PEIR99] and [HAND98] for discussions. 
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KEY POINTS 


® The two basic forms of semiconductor random access memory are dynamic 
RAM (DRAM) and static RAM (SRAM). SRAM is faster, more expen- 
sive, and less dense than DRAM, and is used for cache memory. DRAM is 
used for main memory. 


® Error correction techniques are commonly used in memory systems. These 
involve adding redundant bits that are a function of the data bits to form an 
error-correcting code. If a bit error occurs, the code will detect and, usually, 
correct the error. 

® To compensate for the relatively slow speed of DRAM, a number of ad- 
vanced DRAM organizations have been introduced. The two most com- 
mon are synchronous DRAM and RamBus DRAM. Both of these involve 
using the system clock to provide for the transfer of blocks of data. 


We begin this chapter with a survey of semiconductor main memory subsystems, in- 
cluding ROM, DRAM, and SRAM memories. Then we look at error control tech- 
niques used to enhance memory reliability. Following this, we look at more advanced 
DRAM architectures. 


5.1 SEMICONDUCTOR MAIN MEMORY 


In earlier computers, the most common form of random-access storage for com- 
puter main memory employed an array of doughnut-shaped ferromagnetic loops 
referred to as cores. Hence, main memory was often referred to as core, a term that 
persists to this day. The advent of, and advantages of, microelectronics has long 
since vanquished the magnetic core memory. Today, the use of semiconductor chips 
for main memory is almost universal. Key aspects of this technology are explored 
in this section. 


Organization 


The basic element of a semiconductor memory is the memory cell. Although a vari- 
ety of electronic technologies are used, all semiconductor memory cells share cer- 
tain properties: 
e They exhibit two stable (or semistable) states, which can be used to represent 
binary 1 and 0. 
e They are capable of being written into (at least once), to set the state. 
e They are capable of being read to sense the state. 


Figure 5.1 depicts the operation of a memory cell. Most commonly, the cell 
has three functional terminals capable of carrying an electrical signal. The select 
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Control Control 


Select Data in Sense 


(a) Write (b) Read 
Figure 5.1 Memory Cell Operation 


terminal, as the name suggests, selects a memory cell for a read or write operation. 
The control terminal indicates read or write. For writing, the other terminal pro- 
vides an electrical signal that sets the state of the cell to 1 or 0. For reading, that ter- 
minal is used for output of the cell’s state. The details of the internal organization, 
functioning, and timing of the memory cell depend on the specific integrated circuit 
technology used and are beyond the scope of this book, except for a brief summary. 
For our purposes, we will take it as given that individual cells can be selected for 
reading and writing operations. 


DRAM and SRAM 


All of the memory types that we will explore in this chapter are random access. That is, 
individual words of memory are directly accessed through wired-in addressing logic. 

Table 5.1 lists the major types of semiconductor memory. The most common is 
referred to as random-access memory (RAM). This is, of course, a misuse of the 
term, because all of the types listed in the table are random access. One distinguish- 
ing characteristic of RAM is that it is possible both to read data from the memory 
and to write new data into the memory easily and rapidly. Both the reading and 
writing are accomplished through the use of electrical signals. 


Table 5.1 Semiconductor Memory Types 


Write 
Memory Type Category Erasure Mechanism | Volatility 


Random-access memory Read-write Electrically, 


(RAM) memory byte-level Electrically Volatile 


Read-only memory (ROM) Read-only ’ Masks 
Not possible 
Programmable ROM (PROM) memory 


UV light, 


LE CUA O Bue 2) chip-level Nonvolatile 


Electrically Erasable PROM Read-mostly | Electrically, Electrically 
(EEPROM) memory byte-level 


Electrically, 


Flash memory block-level 


5.1 / SEMICONDUCTOR MAIN MEMORY 161 


The other distinguishing characteristic of RAM is that it is volatile. A RAM 
must be provided with a constant power supply. If the power is interrupted, then the 
data are lost. Thus, RAM can be used only as temporary storage. The two traditional 
forms of RAM used in computers are DRAM and SRAM. 


DYNAMIC RAM RAM technology is divided into two technologies: dynamic and 
static. A dynamic RAM (DRAM) is made with cells that store data as charge on 
capacitors. The presence or absence of charge in a capacitor is interpreted as a bi- 
nary 1 or 0. Because capacitors have a natural tendency to discharge, dynamic 
RAMs require periodic charge refreshing to maintain data storage. The term 
dynamic refers to this tendency of the stored charge to leak away, even with power 
continuously applied. 

Figure 5.2a is a typical DRAM structure for an individual cell that stores 1 bit. 
The address line is activated when the bit value from this cell is to be read or writ- 
ten. The transistor acts as a switch that is closed (allowing current to flow) if a volt- 
age is applied to the address line and open (no current flows) if no voltage is present 
on the address line. 

For the write operation, a voltage signal is applied to the bit line; a high voltage 
represents 1, and a low voltage represents 0. A signal is then applied to the address 
line, allowing a charge to be transferred to the capacitor. 

For the read operation, when the address line is selected, the transistor turns 
on and the charge stored on the capacitor is fed out onto a bit line and to a sense 
amplifier. The sense amplifier compares the capacitor voltage to a reference value 
and determines if the cell contains a logic 1 or a logic 0. The readout from the cell 
discharges the capacitor, which must be restored to complete the operation. 
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(a) Dynamic RAM (DRAM) cell (b) Static RAM (SRAM) cell 


Figure 5.2 Typical Memory Cell Structures 
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Although the DRAM cell is used to store a single bit (0 or 1), it is essentially 
an analog device. The capacitor can store any charge value within a range; a thresh- 
old value determines whether the charge is interpreted as 1 or 0. 


STATIC RAM In contrast, a static RAM (SRAM) is a digital device that uses the 
same logic elements used in the processor. In a SRAM, binary values are stored 
using traditional flip-flop logic-gate configurations (see Chapter 20 for a description 
of flip-flops). A static RAM will hold its data as long as power is supplied to it. 

Figure 5.2b is a typical SRAM structure for an individual cell. Four transistors 
(T,,T2,T3,T4) are cross connected in an arrangement that produces a stable logic state. 
In logic state 1, point C, is high and point C, is low; in this state,T, and Ty, are off and T, 
and T; are on.' In logic state 0, point C, is low and point C; is high; in this state, T, and 
Ty are on and T> and T; are off. Both states are stable as long as the direct current (dc) 
voltage is applied. Unlike the DRAM, no refresh is needed to retain data. 

As in the DRAM, the SRAM address line is used to open or close a switch. 
The address line controls two transistors (T; and T,). When a signal is applied to this 
line, the two transistors are switched on, allowing a read or write operation. For a 
write operation, the desired bit value is applied to line B, while its complement is ap- 
plied to line B. This forces the four transistors (T;,T,T3,T4) into the proper state. 
For a read operation, the bit value is read from line B. 


SRAM VveRsUS DRAM Both static and dynamic RAMs are volatile; that is, power 
must be continuously supplied to the memory to preserve the bit values. A dynamic 
memory cell is simpler and smaller than a static memory cell. Thus, a DRAM is more 
dense (smaller cells = more cells per unit area) and less expensive than a corre- 
sponding SRAM. On the other hand, a DRAM requires the supporting refresh cir- 
cuitry. For larger memories, the fixed cost of the refresh circuitry is more than 
compensated for by the smaller variable cost of DRAM cells. Thus, DRAMs tend to 
be favored for large memory requirements. A final point is that SRAMs are generally 
somewhat faster than DRAMs. Because of these relative characteristics, SRAM is 
used for cache memory (both on and off chip), and DRAM is used for main memory. 


Types of ROM 


As the name suggests, a read-only memory (ROM) contains a permanent pattern of 
data that cannot be changed. A ROM is nonvolatile; that is, no power source is re- 
quired to maintain the bit values in memory. While it is possible to read a ROM, it is 
not possible to write new data into it. An important application of ROMs is micro- 
programming, discussed in Part Four. Other potential applications include 


e Library subroutines for frequently wanted functions 
e System programs 
e Function tables 


For a modest-sized requirement, the advantage of ROM is that the data or program 
is permanently in main memory and need never be loaded from a secondary stor- 
age device. 


'The circles at the head of T; and T, indicate signal negation. 
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A ROM is created like any other integrated circuit chip, with the data actually 
wired into the chip as part of the fabrication process. This presents two problems: 


e The data insertion step includes a relatively large fixed cost, whether one or 
thousands of copies of a particular ROM are fabricated. 


e There is no room for error. If one bit is wrong, the whole batch of ROMs must 
be thrown out. 


When only a small number of ROMs with a particular memory content is 
needed, a less expensive alternative is the programmable ROM (PROM). Like the 
ROM, the PROM is nonvolatile and may be written into only once. For the PROM, 
the writing process is performed electrically and may be performed by a supplier or 
customer at a time later than the original chip fabrication. Special equipment is re- 
quired for the writing or “programming” process. PROMs provide flexibility and 
convenience. The ROM remains attractive for high-volume production runs. 

Another variation on read-only memory is the read-mostly memory, which is 
useful for applications in which read operations are far more frequent than write 
operations but for which nonvolatile storage is required. There are three common 
forms of read-mostly memory: EPROM, EEPROM, and flash memory. 

The optically erasable programmable read-only memory (EPROM) is read 
and written electrically, as with PROM. However, before a write operation, all the 
storage cells must be erased to the same initial state by exposure of the packaged 
chip to ultraviolet radiation. Erasure is performed by shining an intense ultraviolet 
light through a window that is designed into the memory chip. This erasure process 
can be performed repeatedly; each erasure can take as much as 20 minutes to per- 
form. Thus, the EPROM can be altered multiple times and, like the ROM and 
PROM, holds its data virtually indefinitely. For comparable amounts of storage, the 
EPROM is more expensive than PROM, but it has the advantage of the multiple up- 
date capability. 

A more attractive form of read-mostly memory is electrically erasable pro- 
grammable read-only memory (EEPROM). This is a read-mostly memory that can 
be written into at any time without erasing prior contents; only the byte or bytes ad- 
dressed are updated. The write operation takes considerably longer than the read 
operation, on the order of several hundred microseconds per byte. The EEPROM 
combines the advantage of nonvolatility with the flexibility of being updatable in 
place, using ordinary bus control, address, and data lines. EEPROM is more expen- 
sive than EPROM and also is less dense, supporting fewer bits per chip. 

Another form of semiconductor memory is flash memory (so named because 
of the speed with which it can be reprogrammed). First introduced in the mid-1980s, 
flash memory is intermediate between EPROM and EEPROM in both cost and 
functionality. Like EEPROM, flash memory uses an electrical erasing technology. 
An entire flash memory can be erased in one or a few seconds, which is much faster 
than EPROM. In addition, it is possible to erase just blocks of memory rather than an 
entire chip. Flash memory gets its name because the microchip is organized so that a 
section of memory cells are erased in a single action or “flash.” However, flash mem- 
ory does not provide byte-level erasure. Like EPROM, flash memory uses only one 
transistor per bit, and so achieves the high density (compared with EEPROM) of 
EPROM. 
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Chip Logic 


As with other integrated circuit products, semiconductor memory comes in pack- 
aged chips (Figure 2.7). Each chip contains an array of memory cells. 

In the memory hierarchy as a whole, we saw that there are trade-offs among 
speed, capacity, and cost. These trade-offs also exist when we consider the organization 
of memory cells and functional logic on a chip. For semiconductor memories, one of the 
key design issues is the number of bits of data that may be read/written at a time. At one 
extreme is an organization in which the physical arrangement of cells in the array is the 
same as the logical arrangement (as perceived by the processor) of words in memory. 
The array is organized into W words of B bits each. For example, a 16-Mbit chip could 
be organized as 1M 16-bit words. At the other extreme is the so-called 1-bit-per-chip or- 
ganization, in which data are read/written 1 bit at a time. We will illustrate memory chip 
organization with a DRAM; ROM organization is similar, though simpler. 

Figure 5.3 shows a typical organization of a 16-Mbit DRAM. In this case, 4 bits 
are read or written at a time. Logically, the memory array is organized as four square 
arrays of 2048 by 2048 elements. Various physical arrangements are possible. In any 
case, the elements of the array are connected by both horizontal (row) and vertical 
(column) lines. Each horizontal line connects to the Select terminal of each cell in its 
row; each vertical line connects to the Data-In/Sense terminal of each cell in its column. 

Address lines supply the address of the word to be selected. A total of log, W 
lines are needed. In our example, 11 address lines are needed to select one of 2048 
rows. These 11 lines are fed into a row decoder, which has 11 lines of input and 2048 
lines for output. The logic of the decoder activates a single one of the 2048 outputs 
depending on the bit pattern on the 11 input lines (2'! = 2048). 

An additional 11 address lines select one of 2048 columns of 4 bits per column. 
Four data lines are used for the input and output of 4 bits to and from a data buffer. 
On input (write), the bit driver of each bit line is activated for a 1 or 0 according to 
the value of the corresponding data line. On output (read), the value of each bit line 
is passed through a sense amplifier and presented to the data lines. The row line se- 
lects which row of cells is used for reading or writing. 

Because only 4 bits are read/written to this DRAM, there must be multiple 
DRAMs connected to the memory controller to read/write a word of data to the bus. 

Note that there are only 11 address lines (AO-A10), half the number you 
would expect for a 2048 X 2048 array. This is done to save on the number of pins. 
The 22 required address lines are passed through select logic external to the chip 
and multiplexed onto the 11 address lines. First, 11 address signals are passed to the 
chip to define the row address of the array, and then the other 11 address signals are 
presented for the column address. These signals are accompanied by row address se- 
lect (RAS) and column address select (CAS) signals to provide timing to the chip. 

The write enable (WE) and output enable (OE) pins determine whether a 
write or read operation is performed. Two other pins, not shown in Figure 5.3, are 
ground (Vss) and a voltage source (Vcc). 

As an aside, multiplexed addressing plus the use of square arrays result in a 
quadrupling of memory size with each new generation of memory chips. One more 
pin devoted to addressing doubles the number of rows and columns, and so the size 
of the chip memory grows by a factor of 4. 
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Figure 5.3. Typical 16 Megabit DRAM (4M x 4) 
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Figure 5.3 also indicates the inclusion of refresh circuitry. All DRAMs require a 
refresh operation. A simple technique for refreshing is, in effect, to disable the DRAM 
chip while all data cells are refreshed. The refresh counter steps through all of the row 
values. For each row, the output lines from the refresh counter are supplied to the row 
decoder and the RAS line is activated. The data are read out and written back into the 
same location. This causes each cell in the row to be refreshed. 


Chip Packaging 


As was mentioned in Chapter 2, an integrated circuit is mounted on a package that 
contains pins for connection to the outside world. 

Figure 5.4a shows an example EPROM package, which is an 8-Mbit chip orga- 
nized as 1M  X 8. In this case, the organization is treated as a one-word-per-chip 
package. The package includes 32 pins, which is one of the standard chip package 
sizes. The pins support the following signal lines: 


e The address of the word being accessed. For 1M words, a total of 20 (27? = 1M) 
pins are needed (A0-A19). 

e The data to be read out, consisting of 8 lines (DO-D7). 

e The power supply to the chip (V,,). 

e A ground pin (V,,). 

e A chip enable (CE) pin. Because there may be more than one memory chip, 


each of which is connected to the same address bus, the CE pin is used to indi- 
cate whether or not the address is valid for this chip. The CE pin is activated by 
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(a) 8-Mbit EPROM (b) 16-Mbit DRAM 
Figure 5.4 Typical Memory Package Pins and Signals 
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logic connected to the higher-order bits of the address bus (i.e., address bits 
above A19). The use of this signal is illustrated presently. 


e A program voltage (V,,) that is supplied during programming (write operations). 


A typical DRAM pin configuration is shown in Figure 5.4b, for a 16-Mbit chip 
organized as 4M xX 4. There are several differences from a ROM chip. Because a 
RAM can be updated, the data pins are input/output. The write enable (WE) and 
output enable (OE) pins indicate whether this is a write or read operation. Because 
the DRAM is accessed by row and column, and the address is multiplexed, only 11 
address pins are needed to specify the 4M row/column combinations (2!! x 2!! = 
2” = 4M). The functions of the row address select (RAS) and column address se- 


lect (CAS) pins were discussed previously. Finally, the no connect (NC) pin is pro- 
vided so that there are an even number of pins. 


Module Organization 


Ifa RAM chip contains only 1 bit per word, then clearly we will need at least a num- 
ber of chips equal to the number of bits per word. As an example, Figure 5.5 shows 
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Figure 5.6 1-Mbyte Memory Organization 


how a memory module consisting of 256K 8-bit words could be organized. For 256K 
words, an 18-bit address is needed and is supplied to the module from some external 
source (e.g., the address lines of a bus to which the module is attached). The address 
is presented to 8 256K X 1-bit chips, each of which provides the input/output of 1 bit. 

This organization works as long as the size of memory equals the number of 
bits per chip. In the case in which larger memory is required, an array of chips is 
needed. Figure 5.6 shows the possible organization of a memory consisting of 1M 
word by 8 bits per word. In this case, we have four columns of chips, each column 
containing 256K words arranged as in Figure 5.5. For 1M word, 20 address lines are 
needed. The 18 least significant bits are routed to all 32 modules. The high-order 
2 bits are input to a group select logic module that sends a chip enable signal to one 
of the four columns of modules. 
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Interleaved Memory 


Main memory is composed of a collection of DRAM memory chips. A number of chips 
can be grouped together to form a memory bank. It is possible to organize the memory 
banks in a way known as interleaved memory. Each bank is independently able to ser- 
vice a memory read or write request, so that a system with K banks can service K re- 
quests simultaneously, increasing memory read or write rates by a factor of K. If 
consecutive words of memory are stored in different banks, then the transfer of a block 
of memory is speeded up. Appendix E explores the topic of interleaved memory. 
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5.2 ERROR CORRECTION 


A semiconductor memory system is subject to errors. These can be categorized as 
hard failures and soft errors. A hard failure is a permanent physical defect so that 
the memory cell or cells affected cannot reliably store data but become stuck at 0 or 
1 or switch erratically between 0 and 1. Hard errors can be caused by harsh environ- 
mental abuse, manufacturing defects, and wear. A soft error is a random, nondestruc- 
tive event that alters the contents of one or more memory cells without damaging the 
memory. Soft errors can be caused by power supply problems or alpha particles. 
These particles result from radioactive decay and are distressingly common because 
radioactive nuclei are found in small quantities in nearly all materials. Both hard 
and soft errors are clearly undesirable, and most modern main memory systems in- 
clude logic for both detecting and correcting errors. 

Figure 5.7 illustrates in general terms how the process is carried out. When 
data are to be read into memory, a calculation, depicted as a function f, is performed 
on the data to produce a code. Both the code and the data are stored. Thus, if an 
M-bit word of data is to be stored and the code is of length K bits, then the actual 
size of the stored word is M + K bits. 

When the previously stored word is read out, the code is used to detect and pos- 
sibly correct errors. A new set of K code bits is generated from the M data bits and 
compared with the fetched code bits. The comparison yields one of three results: 


e No errors are detected. The fetched data bits are sent out. 


e An error is detected, and it is possible to correct the error. The data bits plus 
error correction bits are fed into a corrector, which produces a corrected set of 
M bits to be sent out. 


e Anerror is detected, but it is not possible to correct it. This condition is reported. 


Codes that operate in this fashion are referred to as error-correcting codes. A code 
is characterized by the number of bit errors in a word that it can correct and detect. 
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Figure 5.7 Error-Correcting Code Function 
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Figure 5.8 Hamming Error-Correcting Code 


The simplest of the error-correcting codes is the Hamming code devised by 
Richard Hamming at Bell Laboratories. Figure 5.8 uses Venn diagrams to illustrate 
the use of this code on 4-bit words (M = 4). With three intersecting circles, there are 
seven compartments. We assign the 4 data bits to the inner compartments (Fig- 
ure 5.8a). The remaining compartments are filled with what are called parity bits. 
Each parity bit is chosen so that the total number of 1s in its circle is even (Fig- 
ure 5.8b). Thus, because circle A includes three data 1s, the parity bit in that circle is 
set to 1. Now, if an error changes one of the data bits (Figure 5.8c), it is easily found. 
By checking the parity bits, discrepancies are found in circle A and circle C but not 
in circle B. Only one of the seven compartments is in A and C but not B. The error 
can therefore be corrected by changing that bit. 

To clarify the concepts involved, we will develop a code that can detect and 
correct single-bit errors in 8-bit words. 

To start, let us determine how long the code must be. Referring to Figure 5.7, 
the comparison logic receives as input two K-bit values. A bit-by-bit comparison is 
done by taking the exclusive-OR of the two inputs. The result is called the syndrome 
word. Thus, each bit of the syndrome is 0 or 1 according to if there is or is not a 
match in that bit position for the two inputs. 

The syndrome word is therefore K bits wide and has a range between 0 and 
2* — 1. The value 0 indicates that no error was detected, leaving 2 — 1 values to 
indicate, if there is an error, which bit was in error. Now, because an error could 
occur on any of the M data bits or K check bits, we must have 


9K -1=>M+K 
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Table 5.2 Increase in Word Length with Error Correction 


Single-Error Correction/ 
Single-Error Correction Double-Error Detection 


Data Bits Check Bits % Increase Check Bits % Increase 


8 
16 


50 62.5 
Sle) 


4 
5 
6 
7 
8 
9 


This inequality gives the number of bits needed to correct a single bit error in a word 
containing M data bits. For example, for a word of 8 data bits (M = 8), we have 


°K =3:2-1<8+3 
eK=4:2'-1>8+4 


Thus, eight data bits require four check bits. The first three columns of Table 5.2 lists 
the number of check bits required for various data word lengths. 

For convenience, we would like to generate a 4-bit syndrome for an 8-bit data 
word with the following characteristics: 


e If the syndrome contains all 0s, no error has been detected. 


e If the syndrome contains one and only one bit set to 1, then an error has oc- 
curred in one of the 4 check bits. No correction is needed. 


e If the syndrome contains more than one bit set to 1, then the numerical value 
of the syndrome indicates the position of the data bit in error. This data bit is 
inverted for correction. 


To achieve these characteristics, the data and check bits are arranged into a 
12-bit word as depicted in Figure 5.9. The bit positions are numbered from 1 to 12. 
Those bit positions whose position numbers are powers of 2 are designated as check 
bits. The check bits are calculated as follows, where the symbol @ designates the ex- 
clusive-OR operation: 


Cl = D106 D286 D46 DS5@ D7 

C2=D1e D3 D4@ D6 D7 

C4 = D26 D36 D4® D8 

C8 = D5® D6® D7@ D8 
ie 12 iat 10 9 8 7 6 5 4 3 2 1 
position 
Position | 4199 | 1011 | 1010 | 1001 | 1000 | o111 | 0110 | o101 | 0100 | oo11 | 0010 | 0001 
number 
Data bit ps | p7 | Do | Ds p4 | p3 | D2 Di 
Check bit C8 c4 @ | ei 


Figure 5.9 Layout of Data Bits and Check Bits 
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Each check bit operates on every data bit whose position number contains a 1 
in the same bit position as the position number of that check bit. Thus, data bit posi- 
tions 3,5, 7,9, and 11 (D1, D2, D4, D5, D7) all contain a 1 in the least significant bit 
of their position number as does C1; bit positions 3, 6, 7, 10, and 11 all contain a 1 in 
the second bit position, as does C2; and so on. Looked at another way, bit position 
is checked by those bits C; such that >}; = n. For example, position 7 is checked by 
bits in position 4,2,and1;and7 =4 +2 +1. 

Let us verify that this scheme works with an example. Assume that the 8-bit 
input word is 00111001, with data bit D1 in the rightmost position. The calculations 
are as follows: 


Cl =160616160=1 
C2 =160616160=1 
C4 = 00606160=1 
C8 = 1616060=0 
Suppose now that data bit 3 sustains an error and is changed from 0 to 1. When the 
check bits are recalculated, we have 
Cl =160616160=1 
C2=161616160=0 
C4 =00616160=0 
C8 = 1616060=0 
When the new check bits are compared with the old check bits, the syndrome word is 
formed: 
C8 C4 C2 Cl 
0 1 1 1 


® 0 0 O 1 
0 1 1 0 


The result is 0110, indicating that bit position 6, which contains data bit 3, is in error. 
Figure 5.10 illustrates the preceding calculation. The data and check bits are po- 
sitioned properly in the 12-bit word. Four of the data bits have a value 1 (shaded in the 


Bie 12 ll 10 9 8 7 6 5 4 3 2 1 
position 

Position 1100 | 1011 | 1010 | 1001 | 1000 | 0111 | 0110 | o101 | 0100 | 0011 | OO10 | O001 
number 

Data bit ps | p7 | De | Ds p4 | p3 | D2 DI 

Check bit cs C4 C2 Cl 
Word 0 0 1 1 0 1 0 0 1 1 1 1 
stored as 

Petts. | <a 0 1 1 0 1 1 0 1 1 1 1 
Position 1100 | 1011 | 1010 | 1001 | 1000 | 0111 | 0110 | 0101 | 0100 | 0011 | 0010 | 0001 
number 

Check bit 0 0 0 1 


Figure 5.10 Check Bit Calculation 
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(c) 


Figure 5.11 Hamming SEC-DEC Code 


table), and their bit position values are XORed to produce the Hamming code 0111, 
which forms the four check digits. The entire block that is stored is 001101001111. Sup- 
pose now that data bit 3, in bit position 6, sustains an error and is changed from 0 to 1. 
The resulting block is 001101101111, with a Hamming code of 0111. An XOR of the 
Hamming code and all of the bit position values for nonzero data bits results in 0110. 
The nonzero result detects an error and indicates that the error is in bit position 6. 

The code just described is known as a single-error-correcting (SEC) code. 
More commonly, semiconductor memory is equipped with a single-error-correcting, 
double-error-detecting (SEC-DED) code. As Table 5.2 shows, such codes require 
one additional bit compared with SEC codes. 

Figure 5.11 illustrates how such a code works, again with a 4-bit data word. The 
sequence shows that if two errors occur (Figure 5.11c), the checking procedure goes 
astray (d) and worsens the problem by creating a third error (e). To overcome the 
problem, an eighth bit is added that is set so that the total number of 1s in the dia- 
gram is even. The extra parity bit catches the error (f). 

An error-correcting code enhances the reliability of the memory at the cost of 
added complexity. With a 1-bit-per-chip organization, an SEC-DED code is generally 
considered adequate. For example, the IBM 30xx implementations used an 8-bit SEC- 
DED code for each 64 bits of data in main memory. Thus, the size of main memory is ac- 
tually about 12% larger than is apparent to the user. The VAX computers used a 7-bit 
SEC-DED for each 32 bits of memory, for a 22% overhead. A number of contempo- 
rary DRAMs use 9 check bits for each 128 bits of data, for a 7% overhead [SHAR97]. 


5.3 ADVANCED DRAM ORGANIZATION 


As discussed in Chapter 2, one of the most critical system bottlenecks when using 
high-performance processors is the interface to main internal memory. This interface 
is the most important pathway in the entire computer system. The basic building 
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Table 5.3 Performance Comparison of Some DRAM Alternatives 


Transfer Rate 
(GB/s) 


Clock Frequency 
(MHz) 


Access Time (ns) Pin Count 


168 
184 
162 


166 
200 
600 


1.3 
Be, 
4.8 


18 


1D 


block of main memory remains the DRAM chip, as it has for decades; until recently, 
there had been no significant changes in DRAM architecture since the early 1970s. 
The traditional DRAM chip is constrained both by its internal architecture and by its 
interface to the processor’s memory bus. 

We have seen that one attack on the performance problem of DRAM main 
memory has been to insert one or more levels of high-speed SRAM cache be- 
tween the DRAM main memory and the processor. But SRAM is much costlier 
than DRAM, and expanding cache size beyond a certain point yields diminishing 
returns. 

In recent years, a number of enhancements to the basic DRAM architecture 
have been explored, and some of these are now on the market. The schemes that 
currently dominate the market are SDRAM, DDR-DRAM, and RDRAM. Table 5.3 
provides a performance comparison. CDRAM has also received considerable atten- 
tion. We examine each of these approaches in this section. 


Synchronous DRAM 


One of the most widely used forms of DRAM is the synchronous DRAM 
(SDRAM) [VOGL94]. Unlike the traditional DRAM, which is asynchronous, the 
SDRAM exchanges data with the processor synchronized to an external clock sig- 
nal and running at the full speed of the processor/memory bus without imposing 
wait states. 

In a typical DRAM, the processor presents addresses and control levels to 
the memory, indicating that a set of data at a particular location in memory should 
be either read from or written into the DRAM. After a delay, the access time, the 
DRAM either writes or reads the data. During the access-time delay, the DRAM 
performs various internal functions, such as activating the high capacitance of the 
row and column lines, sensing the data, and routing the data out through the out- 
put buffers. The processor must simply wait through this delay, slowing system 
performance. 

With synchronous access, the DRAM moves data in and out under control of 
the system clock. The processor or other master issues the instruction and address 
information, which is latched by the DRAM. The DRAM then responds after a set 
number of clock cycles. Meanwhile, the master can safely do other tasks while the 
SDRAM is processing the request. 

Figure 5.12 shows the internal logic of IBM’s 64-Mb SDRAM [IBMO01], which 
is typical of SDRAM organization, and Table 5.4 defines the various pin assignments. 
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Table 5.4 SDRAM Pin Assignments 


AO to A13 Address inputs 


CLK Clock input 

CKE Clock enable 

cs Chip select 

RAS Row address strobe 
CAS Column address strobe 
WE Write enable 

DQ0 to DQ7 Data input/output 
DQM Data mask 


The SDRAM employs a burst mode to eliminate the address setup time and row and 
column line precharge time after the first access. In burst mode, a series of data bits 
can be clocked out rapidly after the first bit has been accessed. This mode is useful 
when all the bits to be accessed are in sequence and in the same row of the array as 
the initial access. In addition, the SDRAM has a multiple-bank internal architecture 
that improves opportunities for on-chip parallelism. 

The mode register and associated control logic is another key feature differen- 
tiating SDRAMs from conventional DRAMs. It provides a mechanism to customize 
the SDRAM to suit specific system needs. The mode register specifies the burst 
length, which is the number of separate units of data synchronously fed onto the 
bus. The register also allows the programmer to adjust the latency between receipt 
of a read request and the beginning of data transfer. 

The SDRAM performs best when it is transferring large blocks of data seri- 
ally, such as for applications like word processing, spreadsheets, and multimedia. 

Figure 5.13 shows an example of SDRAM operation. In this case, the burst 
length is 4 and the latency is 2. The burst read command is initiated by having CS 
and CAS low while holding RAS and WE high at the rising edge of the clock. The 
address inputs determine the starting column address for the burst, and the mode 
register sets the type of burst (sequential or interleave) and the burst length (1, 2, 4, 
8, full page). The delay from the start of the command to when the data from the 
first cell appears on the outputs is equal to the value of the CAS latency that is set in 
the mode register. 
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Figure 5.13 SDRAM Read Timing (burst length = 4, CAS latency = 2) 
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There is now an enhanced version of SDRAM, known as double data rate 
SDRAM (DDR-SDRAM) that overcomes the once-per-cycle limitation. DDR- 
SDRAM can send data to the processor twice per clock cycle. 


Rambus DRAM 


RDRAM, developed by Rambus [FARM92, CRIS97], has been adopted by Intel for 
its Pentium and Itanium processors. It has become the main competitor to SDRAM. 
RDRAM chips are vertical packages, with all pins on one side. The chip exchanges 
data with the processor over 28 wires no more than 12 centimeters long. The bus can 
address up to 320 RDRAM chips and is rated at 1.6 GBps. 

The special RDRAM bus delivers address and control information using an 
asynchronous block-oriented protocol. After an initial 480 ns access time, this pro- 
duces the 1.6 GBps data rate. What makes this speed possible is the bus itself, 
which defines impedances, clocking, and signals very precisely. Rather than being 
controlled by the explicit RAS, CAS, R/W, and CE signals used in conventional 
DRAMs, an RDRAM gets a memory request over the high-speed bus. This re- 
quest contains the desired address, the type of operation, and the number of bytes 
in the operation. 

Figure 5.14 illustrates the RDRAM layout. The configuration consists of a 
controller and a number of RDRAM modules connected via a common bus. The 
controller is at one end of the configuration, and the far end of the bus is a par- 
allel termination of the bus lines. The bus includes 18 data lines (16 actual data, 
two parity) cycling at twice the clock rate; that is, 1 bit is sent at the leading and 
following edge of each clock signal. This results in a signal rate on each data line 
of 800 Mbps. There is a separate set of 8 lines (RC) used for address and control 
signals. There is also a clock signal that starts at the far end from the controller 
propagates to the controller end and then loops back. A RDRAM module sends 
data to the controller synchronously to the clock to master, and the controller 
sends data to an RDRAM synchronously with the clock signal in the opposite 
direction. The remaining bus lines include a reference voltage, ground, and 
power source. 


Controller 
RDRAM 1 RDRAM2 | «e+ | RDRAMn 


INIT 


Bus data [18:0] 
RC [7:0] 


RCIk [2] 

TClk [2] 

Vref 

Gnd (32/18) (@) 
vd(4) Y 


Figure 5.14 RDRAM Structure 
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DDR SDRAM 


SDRAM is limited by the fact that it can only send data to the processor once per 
bus clock cycle. A new version of SDRAM, referred to as double-data-rate SDRAM 
can send data twice per clock cycle, once on the rising edge of the clock pulse and 
once on the falling edge. 

DDR DRAM was developed by the JEDEC Solid State Technology Associa- 
tion, the Electronic Industries Alliance’s semiconductor-engineering-standardization 
body. Numerous companies make DDR chips, which are widely used in desktop 
computers and servers. 

Figure 5.15 shows the basic timing for a DDR read. The data transfer is syn- 
chronized to both the rising and falling edge of the clock. It is also synchronized to a 
bidirectional data strobe (DQS) signal that is provided by the memory controller 
during a read and by the DRAM during a write. In typical implementations the 
DQS is ignored during the read. An explanation of the use of DOS on writes is 
beyond our scope; see [JACO08] for details. 


Clock 


RAS 


CAS 


Row 
Address address 


DQ Valid) Valid | 
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RAS = Row address select 
CAS = Column address select 
DQ = Data (in or out) 

DQS = DQ select 


Figure 5.15 DDR SDRAM Road Timing 
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There have been two generations of improvement to the DDR technology. 
DDR2 increases the data transfer rate by increasing the operational frequency of the 
RAM chip and by increasing the prefetch buffer from 2 bits to 4 bits per chip. The 
prefetch buffer is a memory cache located on the RAM chip. The buffer enables 
the RAM chip to preposition bits to be placed on the data base as rapidly as possi- 
ble. DDR3, introduced in 2007, increases the prefetch buffer size to 8 bits. 

Theoretically, a DDR module can transfer data at a clock rate in the range of 
200 to 600 MHz; a DDR2 module transfers at a clock rate of 400 to 1066 MHz; and 
a DDR3 module transfers at a clock rate of 800 to 1600 MHz. In practice, somewhat 
smaller rates are achieved. 

Appendix K provides more detail on DDR technology. 


Cache DRAM 


Cache DRAM (CDRAM), developed by Mitsubishi [HIDA90, ZHANO1], inte- 
grates a small SRAM cache (16 Kb) onto a generic DRAM chip. 

The SRAM on the CDRAM can be used in two ways. First, it can be used as a 
true cache, consisting of a number of 64-bit lines. The cache mode of the CDRAM is 
effective for ordinary random access to memory. 

The SRAM on the CDRAM can also be used as a buffer to support the serial 
access of a block of data. For example, to refresh a bit-mapped screen, the CDRAM 
can prefetch the data from the DRAM into the SRAM buffer. Subsequent accesses 
to the chip result in accesses solely to the SRAM. 


5.4 RECOMMENDED READING AND WEB SITES 


[PRIN97] provides a comprehensive treatment of semiconductor memory technologies, in- 
cluding SRAM, DRAM, and flash memories. [SHAR97] covers the same material, with more 
emphasis on testing and reliability issues. [SHARO3] and [PRINO2] focus on advanced 
DRAM and SRAM architectures. For an in-depth look at DRAM, see [JACO08] and 
[KEETO1]. [CUPPO1] provides an interesting performance comparison of various DRAM 
schemes. [BEZ03] is a comprehensive introduction to flash memory technology. 

A good explanation of error-correcting codes is contained in [MCELS85]. For a deeper 
study, worthwhile book-length treatments are [ADAM91] and [BLAH83]. A readable theo- 
retical and mathematical treatment of error-correcting codes is [ASH90]. [SHAR97] contains 
a good survey of codes used in contemporary main memories. 


ADAM91 = Adamek, J. Foundations of Coding. New York: Wiley, 1991. 
ASH90_ Ash, R. Information Theory. New York: Dover, 1990. 
BEZ03_ Bez,R.; et al. Introduction to Flash Memory. Proceedings of the IEEE, April 2003. 


BLAHS83 Blahut, R. Theory and Practice of Error Control Codes. Reading, MA: 
Addison-Wesley, 1983. 


CUPPO01 Cuppu, V., et al. “High Performance DRAMS in Workstation Environments.” 
IEEE Transactions on Computers, November 2001. 


JACO08 Jacob, B.; Ng, S.; and Wang, D. Memory Systems: Cache, DRAM, Disk. Boston: 
Morgan Kaufmann, 2008. 

KEETO1 Keeth, B., and Baker, R. DRAM Circuit Design: A Tutorial. Piscataway, NJ: 
IEEE Press, 2001. 
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MCEL85 McEliece, R. “The Reliability of Computer Memories.” Scientific American, 
January 1985. 


PRIN97 Prince, B. Semiconductor Memories. New York: Wiley, 1997. 

PRINO2 Prince, B. Emerging Memories: Technologies and Trends. Norwell, MA: Kluwer, 
2002. 

SHAR97_ Sharma, A. Semiconductor Memories: Technology, Testing, and Reliability. New 
York: IEEE Press, 1997. 


SHAR03 Sharma, A. Advanced Semiconductor Memories: Architectures, Designs, and 
Applications. New York: IEEE Press, 2003. 


Recommended Web sites: 


¢ The RAM Guide: Good overview of RAM technology plus a number of useful links 
¢ RDRAM: Another useful site for RDRAM information 


5.5 KEY TERMS, REVIEW QUESTIONS, AND PROBLEMS 
Key Terms 


cache DRAM (CDRAM) hard failure single-error-correcting, 

dynamic RAM (DRAM) nonvolatile memory double-error-detecting 

electrically erasable program- programmable ROM (SEC-DED) code 
mable ROM (EEPROM) (PROM) soft error 

erasable programmable ROM | RamBus DRAM (RDRAM) static RAM (SRAM) 


(EPROM) read-mostly memory synchronous DRAM 
error correcting code (ECC) read-only memory (ROM) (SDRAM) 
error correction semiconductor memory syndrome 
flash memory single-error-correcting (SEC) volatile memory 
Hamming code code 


Review Questions 


5.1 What are the key properties of semiconductor memory? 

5.2 What are two senses in which the term random-access memory is used? 

5.3 What is the difference between DRAM and SRAM in terms of application? 

5.4 What is the difference between DRAM and SRAM in terms of characteristics such as 
speed, size, and cost? 

5.5 Explain why one type of RAM is considered to be analog and the other digital. 

5.6 What are some applications for ROM? 

5.7 What are the differences among EPROM, EEPROM, and flash memory? 

5.8 Explain the function of each pin in Figure 5.4b. 
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5.9 What is a parity bit? 
5.10 How is the syndrome for the Hamming code interpreted? 
5.11 How does SDRAM differ from ordinary DRAM? 


Problems 


5.1 Suggest reasons why RAMS traditionally have been organized as only 1 bit per chip 
whereas ROMs are usually organized with multiple bits per chip. 


5.2 Consider a dynamic RAM that must be given a refresh cycle 64 times per ms. Each re- 
fresh operation requires 150 ns; a memory cycle requires 250 ns. What percentage of 
the memory’s total operating time must be given to refreshes? 


5.3. Figure 5.16 shows a simplified timing diagram for a DRAM read operation over a 
bus. The access time is considered to last from ¢; to t7. Then there is a recharge time, 
lasting from f, to tz, during which the DRAM chips will have to recharge before the 
processor can access them again. 

a. Assume that the access time is 60 ns and the recharge time is 40 ns. What is the 
memory cycle time? What is the maximum data rate this DRAM can sustain, as- 
suming a 1-bit output? 

b. Constructing a 32-bit wide memory system using these chips yields what data 
transfer rate? 

5.4 Figure 5.6 indicates how to construct a module of chips that can store 1 MByte based 
on a group of four 256-Kbyte chips. Let’s say this module of chips is packaged as a 
single 1-Mbyte chip, where the word size is 1 byte. Give a high-level chip diagram of 
how to construct an 8-Mbyte computer memory using eight 1-Mbyte chips. Be sure to 
show the address lines in your diagram and what the address lines are used for. 

5.5 Ona typical Intel 8086-based system, connected via system bus to DRAM memory, 
for a read operation, RAS is activated by the trailing edge of the Address Enable sig- 
nal (Figure 3.19). However, due to propagation and other delays, RAS does not go 
active until 50 ns after Address Enable returns to a low. Assume the latter occurs in 
the middle of the second half of state T, (somewhat earlier than in Figure 3.19). Data 
are read by the processor at the end of T3. For timely presentation to the processor, 
however, data must be provided 60 ns earlier by memory. This interval accounts for 
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Figure 5.16 Simplified DRAM Read Timing 
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propagation delays along the data paths (from memory to processor) and processor 

data hold time requirements. Assume a clocking rate of 10 MHz. 

a. How fast (access time) should the DRAMs be if no wait states are to be inserted? 

b. How many wait states do we have to insert per memory read operation if the ac- 
cess time of the DRAMs is 150 ns? 


5.6 The memory of a particular microcomputer is built from 64K x 1 DRAMs. Accord- 
ing to the data sheet, the cell array of the DRAM is organized into 256 rows. Each 
row must be refreshed at least once every 4 ms. Suppose we refresh the memory on a 
strictly periodic basis. 
a. What is the time period between successive refresh requests? 
b. How long a refresh address counter do we need? 
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Figure 5.17 shows one of the early SRAMs, the 16 x 4 Signetics 7489 chip, which 

stores 16 4-bit words. a: 

a. List the mode of operation of the chip for each CS input pulse shown in 
Figure 5.17c. 

b. List the memory contents of word locations 0 through 6 after pulse n. 

c. What is the state of the output data leads for the input pulses h through m? 

Design a 16-bit memory of total capacity 8192 bits using SRAM chips of size 64 x 1 bit. 

Give the array configuration of the chips on the memory board showing all required 

input and output signals for assigning this memory to the lowest address space. The 

design should allow for both byte and 16-bit word accesses. 

A common unit of measure for failure rates of electronic components is the Failure 

unIT (FIT), expressed as a rate of failures per billion device hours. Another well 

known but less used measure is mean time between failures (MTBF), which is the av- 

erage time of operation of a particular component until it fails. Consider a 1 MB 

memory of a 16-bit microprocessor with 256K X 1 DRAMs. Calculate its MTBF as- 

suming 2000 FITS for each DRAM. 

For the Hamming code shown in Figure 5.10, show what happens when a check bit 

rather than a data bit is in error? 

Suppose an 8-bit data word stored in memory is 11000010. Using the Hamming algo- 

rithm, determine what check bits would be stored in memory with the data word. 

Show how you got your answer. 

For the 8-bit word 00111001, the check bits stored with it would be 0111. Suppose 

when the word is read from memory, the check bits are calculated to be 1101. What is 

the data word that was read from memory? 

How many check bits are needed if the Hamming error correction code is used to de- 

tect single bit errors in a 1024-bit data word? 

Develop an SEC code for a 16-bit data word. Generate the code for the data word 

0101000000111001. Show that the code will correctly identify an error in data bit 5. 
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KEY POINTS 


® Magnetic disks remain the most important component of external memory. 
Both removable and fixed, or hard, disks are used in systems ranging from 
personal computers to mainframes and supercomputers. 


® ‘To achieve greater performance and higher availability, servers and larger 
systems use RAID disk technology. RAID is a family of techniques for 
using multiple disks as a parallel array of data storage devices, with redun- 
dancy built in to compensate for disk failure. 


® Optical storage technology has become increasingly important in all types 
of computer systems. While CD-ROM has been widely used for many 
years, more recent technologies, such as writable CD and DVD, are becom- 
ing increasingly important. 


This chapter examines a range of external memory devices and systems. We begin with 
the most important device, the magnetic disk. Magnetic disks are the foundation of ex- 
ternal memory on virtually all computer systems. The next section examines the use of 
disk arrays to achieve greater performance, looking specifically at the family of systems 
known as RAID (Redundant Array of Independent Disks). An increasingly important 
component of many computer systems is external optical memory, and this is examined 
in the third section. Finally, magnetic tape is described. 


6.1 MAGNETIC DISK 


A disk is a circular platter constructed of nonmagnetic material, called the substrate, 
coated with a magnetizable material. Traditionally, the substrate has been an alu- 
minum or aluminum alloy material. More recently, glass substrates have been intro- 
duced. The glass substrate has a number of benefits, including the following: 


e Improvement in the uniformity of the magnetic film surface to increase disk 
reliability 

e A significant reduction in overall surface defects to help reduce read-write errors 

e Ability to support lower fly heights (described subsequently) 

e Better stiffness to reduce disk dynamics 

e Greater ability to withstand shock and damage 


Magnetic Read and Write Mechanisms 


Data are recorded on and later retrieved from the disk via a conducting coil named 
the head; in many systems, there are two heads, a read head and a write head. During 
a read or write operation, the head is stationary while the platter rotates beneath it. 

The write mechanism exploits the fact that electricity flowing through a coil 
produces a magnetic field. Electric pulses are sent to the write head, and the resulting 
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Figure 6.1 Inductive Write/Magnetoresistive Read Head 


magnetic patterns are recorded on the surface below, with different patterns for pos- 
itive and negative currents. The write head itself is made of easily magnetizable ma- 
terial and is in the shape of a rectangular doughnut with a gap along one side and a 
few turns of conducting wire along the opposite side (Figure 6.1). An electric current 
in the wire induces a magnetic field across the gap, which in turn magnetizes a small 
area of the recording medium. Reversing the direction of the current reverses the di- 
rection of the magnetization on the recording medium. 

The traditional read mechanism exploits the fact that a magnetic field moving 
relative to a coil produces an electrical current in the coil. When the surface of the disk 
passes under the head, it generates a current of the same polarity as the one already 
recorded. The structure of the head for reading is in this case essentially the same as 
for writing and therefore the same head can be used for both. Such single heads are 
used in floppy disk systems and in older rigid disk systems. 

Contemporary rigid disk systems use a different read mechanism, requiring a 
separate read head, positioned for convenience close to the write head. The read 
head consists of a partially shielded magnetoresistive (MR) sensor. The MR material 
has an electrical resistance that depends on the direction of the magnetization of the 
medium moving under it. By passing a current through the MR sensor, resistance 
changes are detected as voltage signals. The MR design allows higher-frequency 
operation, which equates to greater storage densities and operating speeds. 


Data Organization and Formatting 


The head is a relatively small device capable of reading from or writing to a portion 
of the platter rotating beneath it. This gives rise to the organization of data on the 
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Figure 6.2 Disk Data Layout 


platter in a concentric set of rings, called tracks. Each track is the same width as the 
head. There are thousands of tracks per surface. 

Figure 6.2 depicts this data layout. Adjacent tracks are separated by gaps. This 
prevents, or at least minimizes, errors due to misalignment of the head or simply 
interference of magnetic fields. 

Data are transferred to and from the disk in sectors (Figure 6.2). There are 
typically hundreds of sectors per track, and these may be of either fixed or variable 
length. In most contemporary systems, fixed-length sectors are used, with 512 bytes 
being the nearly universal sector size. To avoid imposing unreasonable precision 
requirements on the system, adjacent sectors are separated by intratrack (intersec- 
tor) gaps. 

A bit near the center of a rotating disk travels past a fixed point (such as a 
read—write head) slower than a bit on the outside. Therefore, some way must be 
found to compensate for the variation in speed so that the head can read all the bits 
at the same rate. This can be done by increasing the spacing between bits of informa- 
tion recorded in segments of the disk. The information can then be scanned at the 
same rate by rotating the disk at a fixed speed, known as the constant angular veloc- 
ity (CAV). Figure 6.3a shows the layout of a disk using CAV. The disk is divided into 
a number of pie-shaped sectors and into a series of concentric tracks. The advantage 
of using CAV is that individual blocks of data can be directly addressed by track and 
sector. To move the head from its current location to a specific address, it only takes 
a short movement of the head to a specific track and a short wait for the proper sec- 
tor to spin under the head. The disadvantage of CAV is that the amount of data that 
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(a) Constant angular velocity (b) Multiple zoned recording 


Figure 6.3 Comparison of Disk Layout Methods 


can be stored on the long outer tracks is the only same as what can be stored on the 
short inner tracks. 

Because the density, in bits per linear inch, increases in moving from the out- 
ermost track to the innermost track, disk storage capacity in a straightforward CAV 
system is limited by the maximum recording density that can be achieved on the in- 
nermost track. To increase density, modern hard disk systems use a technique 
known as multiple zone recording, in which the surface is divided into a number of 
concentric zones (16 is typical). Within a zone, the number of bits per track is con- 
stant. Zones farther from the center contain more bits (more sectors) than zones 
closer to the center. This allows for greater overall storage capacity at the expense of 
somewhat more complex circuitry. As the disk head moves from one zone to an- 
other, the length (along the track) of individual bits changes, causing a change in the 
timing for reads and writes. Figure 6.3b suggests the nature of multiple zone record- 
ing; in this illustration, each zone is only a single track wide. 

Some means is needed to locate sector positions within a track. Clearly, there 
must be some starting point on the track and a way of identifying the start and end 
of each sector. These requirements are handled by means of control data recorded 
on the disk. Thus, the disk is formatted with some extra data used only by the disk 
drive and not accessible to the user. 

An example of disk formatting is shown in Figure 6.4. In this case, each track 
contains 30 fixed-length sectors of 600 bytes each. Each sector holds 512 bytes of 
data plus control information useful to the disk controller. The ID field is a unique 
identifier or address used to locate a particular sector. The SYNCH byte is a special 
bit pattern that delimits the beginning of the field. The track number identifies a 
track on a surface. The head number identifies a head, because this disk has multi- 
ple surfaces (explained presently). The ID and data fields each contain an error- 
detecting code. 


Physical Characteristics 


Table 6.1 lists the major characteristics that differentiate among the various types of 
magnetic disks. First, the head may either be fixed or movable with respect to the ra- 
dial direction of the platter. In a fixed-head disk, there is one read-write head per 
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Figure 6.4 Winchester Disk Format (Seagate ST506) 


track. All of the heads are mounted on a rigid arm that extends across all tracks; 
such systems are rare today. In a movable-head disk, there is only one read-write 
head. Again, the head is mounted on an arm. Because the head must be able to be 
positioned above any track, the arm can be extended or retracted for this purpose. 

The disk itself is mounted in a disk drive, which consists of the arm, a spindle 
that rotates the disk, and the electronics needed for input and output of binary data. 
A nonremovable disk is permanently mounted in the disk drive; the hard disk in a 
personal computer is a nonremovable disk. A removable disk can be removed and 
replaced with another disk. The advantage of the latter type is that unlimited 
amounts of data are available with a limited number of disk systems. Furthermore, 
such a disk may be moved from one computer system to another. Floppy disks and 
ZIP cartridge disks are examples of removable disks. 

For most disks, the magnetizable coating is applied to both sides of the platter, 
which is then referred to as double sided. Some less expensive disk systems use 
single-sided disks. 


Table 6.1 Physical Characteristics of Disk Systems 


Head Motion Platters 
Fixed head (one per track) Single platter 
Movable head (one per surface) Multiple platter 


Disk Portability Head Mechanism 
Nonremovable disk Contact (floppy) 
Removable disk Fixed gap 


Aerodynamic gap (Winchester) 


Sides 
Single sided 
Double sided 
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Figure 6.5 Components of a Disk Drive 


Some disk drives accommodate multiple platters stacked vertically a fraction 
of an inch apart. Multiple arms are provided (Figure 6.5). Multiple—platter disks em- 
ploy a movable head, with one read-write head per platter surface. All of the heads 
are mechanically fixed so that all are at the same distance from the center of the 
disk and move together. Thus, at any time, all of the heads are positioned over tracks 
that are of equal distance from the center of the disk. The set of all the tracks in the 
same relative position on the platter is referred to as a cylinder. For example, all of 
the shaded tracks in Figure 6.6 are part of one cylinder. 

Finally, the head mechanism provides a classification of disks into three types. 
Traditionally, the read-write head has been positioned a fixed distance above the 
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Figure 6.6 Tracks and Cylinders 


6.1 / MAGNETIC DISK 191 


platter, allowing an air gap. At the other extreme is a head mechanism that actually 
comes into physical contact with the medium during a read or write operation. This 
mechanism is used with the floppy disk, which is a small, flexible platter and the 
least expensive type of disk. 

To understand the third type of disk, we need to comment on the relationship 
between data density and the size of the air gap. The head must generate or sense 
an electromagnetic field of sufficient magnitude to write and read properly. The 
narrower the head is, the closer it must be to the platter surface to function. A nar- 
rower head means narrower tracks and therefore greater data density, which is de- 
sirable. However, the closer the head is to the disk, the greater the risk of error 
from impurities or imperfections. To push the technology further, the Winchester 
disk was developed. Winchester heads are used in sealed drive assemblies that are 
almost free of contaminants. They are designed to operate closer to the disk’s sur- 
face than conventional rigid disk heads, thus allowing greater data density. The 
head is actually an aerodynamic foil that rests lightly on the platter’s surface when 
the disk is motionless. The air pressure generated by a spinning disk is enough to 
make the foil rise above the surface. The resulting noncontact system can be engi- 
neered to use narrower heads that operate closer to the platter’s surface than con- 
ventional rigid disk heads.! 

Table 6.2 gives disk parameters for typical contemporary high-performance 
disks. 


Table 6.2 Typical Hard Disk Drive Parameters 


Hitachi 


Characteristics 


Seagate 
Barracuda 
ES.2 


Seagate 
Barracuda 
7200.10 


Seagate 
Barracuda 
7200.9 


Seagate 


Micro- 
drive 


Application 


High-capacity 
server 


High-performance 
desktop 


Entry-level 
desktop 


Laptop 


Handheld 
devices 


Capacity 


1TB 


750 GB 


160 GB 


120 GB 


8 GB 


Minimum track-to-track 
seek time 


0.8 ms 


0.3 ms 


1.0 ms 


1.0 ms 


Average seek time 


8.5 ms 


3.6 ms 


9.5 ms 


12.5 ms 


12 ms 


Spindle speed 


7200 rpm 


7200 rpm 


7200 


5400 rpm 


3600 rpm 


Average rotational delay 


4.16 ms 


4.16 ms 


4.17 ms 


5.6 ms 


8.33 ms 


Maximum transfer rate 


3 GB/s 


300 MB/s 


300 MB/s 


150 MB/s 


10 MB/s 


Bytes per sector 


512 


512 


aly 


512 


512 


Tracks per cylinder (num- 
ber of platter surfaces) 


8 


8 


2 


8 


2 


‘As a matter of historical interest, the term Winchester was originally used by IBM as a code name for the 
3340 disk model prior to its announcement. The 3340 was a removable disk pack with the heads sealed 
within the pack. The term is now applied to any sealed-unit disk drive with aerodynamic head design. The 
Winchester disk is commonly found built in to personal computers and workstations, where it is referred 
to as a hard disk. 
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Figure 6.7 Timing of a Disk I/O Transfer 


Disk Performance Parameters 


The actual details of disk I/O operation depend on the computer system, the operat- 
ing system, and the nature of the I/O channel and disk controller hardware. A gen- 
eral timing diagram of disk I/O transfer is shown in Figure 6.7. 

When the disk drive is operating, the disk is rotating at constant speed. To read 
or write, the head must be positioned at the desired track and at the beginning of the 
desired sector on that track. Track selection involves moving the head in a movable- 
head system or electronically selecting one head on a fixed-head system. On a movable- 
head system, the time it takes to position the head at the track is known as seek 
time. In either case, once the track is selected, the disk controller waits until the 
appropriate sector rotates to line up with the head. The time it takes for the beginning 
of the sector to reach the head is known as rotational delay, or rotational latency. 
The sum of the seek time, if any, and the rotational delay equals the access time, 
which is the time it takes to get into position to read or write. Once the head is in po- 
sition, the read or write operation is then performed as the sector moves under the 
head; this is the data transfer portion of the operation; the time required for the 
transfer is the transfer time. 

In addition to the access time and transfer time, there are several queuing 
delays normally associated with a disk I/O operation. When a process issues an I/O 
request, it must first wait in a queue for the device to be available. At that time, the 
device is assigned to the process. If the device shares a single I/O channel or a set of 
I/O channels with other disk drives, then there may be an additional wait for the 
channel to be available. At that point, the seek is performed to begin disk access. 

In some high-end systems for servers, a technique known as rotational posi- 
tional sensing (RPS) is used. This works as follows: When the seek command has 
been issued, the channel is released to handle other I/O operations. When the seek 
is completed, the device determines when the data will rotate under the head. As 
that sector approaches the head, the device tries to reestablish the communication 
path back to the host. If either the control unit or the channel is busy with another 
1/O, then the reconnection attempt fails and the device must rotate one whole revo- 
lution before it can attempt to reconnect, which is called an RPS miss. This is an 
extra delay element that must be added to the timeline of Figure 6.7. 


SEEK TIME Seek time is the time required to move the disk arm to the required 
track. It turns out that this is a difficult quantity to pin down. The seek time consists 
of two key components: the initial startup time, and the time taken to traverse the 
tracks that have to be crossed once the access arm is up to speed. Unfortunately, the 
traversal time is not a linear function of the number of tracks, but includes a settling 
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time (time after positioning the head over the target track until track identification 
is confirmed). 

Much improvement comes from smaller and lighter disk components. Some 
years ago, a typical disk was 14 inches (36 cm) in diameter, whereas the most com- 
mon size today is 3.5 inches (8.9 cm), reducing the distance that the arm has to 
travel. A typical average seek time on contemporary hard disks is under 10 ms. 


ROTATIONAL DELAY Disks, other than floppy disks, rotate at speeds ranging from 
3600 rpm (for handheld devices such as digital cameras) up to, as of this writing, 
20,000 rpm; at this latter speed, there is one revolution per 3 ms. Thus, on the aver- 
age, the rotational delay will be 1.5 ms. 


TRANSFER TIME The transfer time to or from the disk depends on the rotation 
speed of the disk in the following fashion: 


eee 
rN 
where 
T = transfer time 
b = number of bytes to be transferred 
N 


r 


number of bytes on a track 


rotation speed, in revolutions per second 
Thus the total average access time can be expressed as 
1 b 


T= T+ + 
e ° 2r rN 


where T, is the average seek time. Note that on a zoned drive, the number of bytes 
per track is variable, complicating the calculation.” 


A TIMING COMPARISON With the foregoing parameters defined, let us look at two 
different I/O operations that illustrate the danger of relying on average values. Con- 
sider a disk with an advertised average seek time of 4 ms, rotation speed of 15,000 rpm, 
and 512-byte sectors with 500 sectors per track. Suppose that we wish to read a file 
consisting of 2500 sectors for a total of 1.28 Mbytes. We would like to estimate the 
total time for the transfer. 

First, let us assume that the file is stored as compactly as possible on the disk. 
That is, the file occupies all of the sectors on 5 adjacent tracks (5 tracks x 500 sectors/ 
track = 2500 sectors). This is known as sequential organization. Now, the time to 
read the first track is as follows: 


Average seek 4 ms 
Average rotational delay 2ms 
Read 500 sectors 4 ms 

10 ms 


?Compare the two preceding equations to Equation (4.1). 
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Suppose that the remaining tracks can now be read with essentially no seek 
time. That is, the I/O operation can keep up with the flow from the disk. Then, at 
most, we need to deal with rotational delay for each succeeding track. Thus each 
successive track is read in 2 + 4 = 6 ms.To read the entire file, 


Total time = 10 + (4 X 6) = 34 ms = 0.034 seconds 


Now let us calculate the time required to read the same data using random 
access rather than sequential access; that is, accesses to the sectors are distributed 
randomly over the disk. For each sector, we have 


Average seek 4 ms 
Rotational delay 2 ms 
Read 1 sectors 0.008 ms 

6.008 ms 


Total time = 2500 < 6.008 = 15020 ms = 15.02 seconds 


It is clear that the order in which sectors are read from the disk has a tremen- 
dous effect on I/O performance. In the case of file access in which multiple sectors 
are read or written, we have some control over the way in which sectors of data are 
deployed. However, even in the case of a file access, in a multiprogramming environ- 
ment, there will be I/O requests competing for the same disk. Thus, it is worthwhile 
to examine ways in which the performance of disk I/O can be improved over that 
achieved with purely random access to the disk. This leads to a consideration of disk 
scheduling algorithms, which is the province of the operating system and beyond the 
scope of this book (see [STAL09] for a discussion). 


RAID Simulator 


6.2 RAID 


As discussed earlier, the rate in improvement in secondary storage performance has 
been considerably less than the rate for processors and main memory. This mis- 
match has made the disk storage system perhaps the main focus of concern in im- 
proving overall computer system performance. 

As in other areas of computer performance, disk storage designers recognize 
that if one component can only be pushed so far, additional gains in performance 
are to be had by using multiple parallel components. In the case of disk storage, 
this leads to the development of arrays of disks that operate independently and in 
parallel. With multiple disks, separate I/O requests can be handled in parallel, as 
long as the data required reside on separate disks. Further, a single I/O request 
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can be executed in parallel if the block of data to be accessed is distributed across 
multiple disks. 

With the use of multiple disks, there is a wide variety of ways in which the data 
can be organized and in which redundancy can be added to improve reliability. This 
could make it difficult to develop database schemes that are usable on a number of 
platforms and operating systems. Fortunately, industry has agreed on a standardized 
scheme for multiple-disk database design, known as RAID (Redundant Array of 
Independent Disks). The RAID scheme consists of seven levels,* zero through six. 
These levels do not imply a hierarchical relationship but designate different design 
architectures that share three common characteristics: 


1. RAID is a set of physical disk drives viewed by the operating system as a sin- 
gle logical drive. 


2. Data are distributed across the physical drives of an array in a scheme known as 
striping, described subsequently. 


3. Redundant disk capacity is used to store parity information, which guarantees 
data recoverability in case of a disk failure. 


The details of the second and third characteristics differ for the different RAID lev- 
els. RAID 0 and RAID 1 do not support the third characteristic. 

The term RAID was originally coined in a paper by a group of researchers at 
the University of California at Berkeley [PATT88].* The paper outlined various 
RAID configurations and applications and introduced the definitions of the RAID 
levels that are still used. The RAID strategy employs multiple disk drives and dis- 
tributes data in such a way as to enable simultaneous access to data from multiple 
drives, thereby improving I/O performance and allowing easier incremental in- 
creases in capacity. 

The unique contribution of the RAID proposal is to address effectively the 
need for redundancy. Although allowing multiple heads and actuators to operate 
simultaneously achieves higher I/O and transfer rates, the use of multiple devices 
increases the probability of failure. To compensate for this decreased reliability, 
RAID makes use of stored parity information that enables the recovery of data lost 
due to a disk failure. 

We now examine each of the RAID levels. Table 6.3 provides a rough guide to 
the seven levels. In the table, I/O performance is shown both in terms of data trans- 
fer capacity, or ability to move data, and I/O request rate, or ability to satisfy I/O re- 
quests, since these RAID levels inherently perform differently relative to these two 


3 Additional levels have been defined by some researchers and some companies, but the seven levels 
described in this section are the ones universally agreed on. 


‘In that paper, the acronym RAID stood for Redundant Array of Inexpensive Disks. The term 
inexpensive was used to contrast the small relatively inexpensive disks in the RAID array to the alterna- 
tive, a single large expensive disk (SLED). The SLED is essentially a thing of the past, with similar disk 
technology being used for both RAID and non-RAID configurations. Accordingly, the industry has 
adopted the term independent to emphasize that the RAID array creates significant performance and 
reliability gains. 
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Category 


6.3 RAID Levels 


Description 


Disks 
Required 


Data Availability 


Large I/O Data 
Transfer Capacity 


Small 1/O 
Request Rate 


Striping 


Nonredundant 


N 


Lower than 
single disk 


Very high 


Very high for both read 
and write 


Mirroring 


Mirrored 


Higher than RAID 2, 
3,4, or 5; lower than 
RAID 6 


Higher than single disk 
for read; similar to sin- 
gle disk for write 


Up to twice that of a 
single disk for read; 
similar to single disk 
for write 


Parallel access 


Redundant via Ham- 
ming code 


Much higher than single 
disk; comparable to 
RAID 3,4, or 5 


Highest of all listed 
alternatives 


Approximately twice 
that of a single disk 


Bit-interleaved parity 


Much higher than single 
disk; comparable to 
RAID 2,4, or 5 


Highest of all listed 
alternatives 


Approximately twice 
that of a single disk 


Independent 
access 


Block-interleaved 
parity 


Much higher than single 
disk; comparable to 
RAID 2, 3, or 5 


Similar to RAID 0 for 
read; significantly lower 
than single disk for 
write 


Similar to RAID 0 for 
read; significantly lower 
than single disk for 
write 


Block-interleaved 
distributed parity 


Much higher than single 
disk; comparable to 
RAID 2, 3, or 4 


Similar to RAID 0 for 
read; lower than single 
disk for write 


Similar to RAID 0 for 
read; generally lower 
than single disk for 
write 


Block-interleaved 
dual distributed 


parity 


N = number of data disks; m proportional to log N 


Highest of all listed 
alternatives 


Similar to RAID 0 for 
read; lower than RAID 
5 for write 


Similar to RAID 0 for 
read; significantly lower 
than RAID 5 for write 


(a) RAID 0 (Nonredundant) 
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(b) RAID 1 (Mirrored) 
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(c) RAID 2 (Redundancy through Hamming code) 
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Figure 6.8 RAID Levels 
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a 


metrics. Each RAID level’s strong point is highlighted by darker shading. Figure 6.8 
illustrates the use of the seven RAID schemes to support a data capacity requiring 
four disks with no redundancy. The figures highlight the layout of user data and re- 
dundant data and indicates the relative storage requirements of the various levels. 


We refer to these figures throughout the following discussion. 


RAID Level 0 


RAID level 0 is not a true member of the RAID family because it does not include 
redundancy to improve performance. However, there are a few applications, such as 
some on supercomputers in which performance and capacity are primary concerns 
and low cost is more important than improved reliability. 

For RAID 0, the user and system data are distributed across all of the disks in 
the array. This has a notable advantage over the use of a single large disk: If two 
different I/O requests are pending for two different blocks of data, then there is a 
good chance that the requested blocks are on different disks. Thus, the two requests 
can be issued in parallel, reducing the I/O queuing time. 

But RAID 0, as with all of the RAID levels, goes further than simply distribut- 
ing the data across a disk array: The data are striped across the available disks. This is 
best understood by considering Figure 6.9. All of the user and system data are viewed 


198 CHAPTER 6 / EXTERNAL MEMORY 


(d) RAID 3 (Bit-interleaved parity) 
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(e) RAID 4 (Block-level parity) 
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(g) RAID 6 (Dual redundancy) 
Figure 6.8 RAID Levels (continued ) 


as being stored on a logical disk. The logical disk is divided into strips; these strips 
may be physical blocks, sectors, or some other unit. The strips are mapped round 
robin to consecutive physical disks in the RAID array. A set of logically consecutive 
strips that maps exactly one strip to each array member is referred to as a stripe. In 
an n-disk array, the first n logical strips are physically stored as the first strip on each 
of the n disks, forming the first stripe; the second n strips are distributed as the second 
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Figure 6.9 Data Mapping for a RAID Level 0 Array 


strips on each disk; and so on. The advantage of this layout is that if a single I/O re- 
quest consists of multiple logically contiguous strips, then up to n strips for that re- 
quest can be handled in parallel, greatly reducing the I/O transfer time. 

Figure 6.9 indicates the use of array management software to map between 
logical and physical disk space. This software may execute either in the disk subsys- 
tem or in a host computer. 


RAID 0 FOR HIGH DATA TRANSFER CAPACITY The performance of any of the 
RAID levels depends critically on the request patterns of the host system and 
on the layout of the data. These issues can be most clearly addressed in RAID 0, 
where the impact of redundancy does not interfere with the analysis. First, let us 
consider the use of RAID 0 to achieve a high data transfer rate. For applications to 
experience a high transfer rate, two requirements must be met. First, a high transfer 
capacity must exist along the entire path between host memory and the individual 
disk drives. This includes internal controller buses, host system I/O buses, I/O 
adapters, and host memory buses. 

The second requirement is that the application must make I/O requests that 
drive the disk array efficiently. This requirement is met if the typical request is for 
large amounts of logically contiguous data, compared to the size of a strip. In this 
case, a single I/O request involves the parallel transfer of data from multiple disks, 
increasing the effective transfer rate compared to a single-disk transfer. 
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RAID 0 FOR HIGH I/O REQUEST RATE In a transaction-oriented environment, 
the user is typically more concerned with response time than with transfer rate. For an 
individual I/O request for a small amount of data, the I/O time is dominated by the mo- 
tion of the disk heads (seek time) and the movement of the disk (rotational latency). 

In a transaction environment, there may be hundreds of I/O requests per sec- 
ond. A disk array can provide high I/O execution rates by balancing the I/O load 
across multiple disks. Effective load balancing is achieved only if there are typically 
multiple I/O requests outstanding. This, in turn, implies that there are multiple inde- 
pendent applications or a single transaction-oriented application that is capable of 
multiple asynchronous I/O requests. The performance will also be influenced by the 
strip size. If the strip size is relatively large, so that a single I/O request only involves 
a single disk access, then multiple waiting I/O requests can be handled in parallel, 
reducing the queuing time for each request. 


RAID Level 1 


RAID 1 differs from RAID levels 2 through 6 in the way in which redundancy is 
achieved. In these other RAID schemes, some form of parity calculation is used to 
introduce redundancy, whereas in RAID 1, redundancy is achieved by the simple 
expedient of duplicating all the data. As Figure 6.8b shows, data striping is used, as in 
RAID 0. But in this case, each logical strip is mapped to two separate physical disks 
so that every disk in the array has a mirror disk that contains the same data. RAID 
1 can also be implemented without data striping, though this is less common. 
There are a number of positive aspects to the RAID 1 organization: 


1. A read request can be serviced by either of the two disks that contains the 
requested data, whichever one involves the minimum seek time plus rotational 
latency. 


2. A write request requires that both corresponding strips be updated, but this can 
be done in parallel. Thus, the write performance is dictated by the slower of the 
two writes (i.e., the one that involves the larger seek time plus rotational latency). 
However, there is no “write penalty” with RAID 1. RAID levels 2 through 6 in- 
volve the use of parity bits. Therefore, when a single strip is updated, the array 
management software must first compute and update the parity bits as well as 
updating the actual strip in question. 


3. Recovery from a failure is simple. When a drive fails, the data may still be ac- 
cessed from the second drive. 


The principal disadvantage of RAID 1 is the cost; it requires twice the disk 
space of the logical disk that it supports. Because of that, a RAID 1 configuration is 
likely to be limited to drives that store system software and data and other highly 
critical files. In these cases, RAID 1 provides real-time copy of all data so that in the 
event of a disk failure, all of the critical data are still immediately available. 

In a transaction-oriented environment, RAID 1 can achieve high I/O request 
rates if the bulk of the requests are reads. In this situation, the performance of 
RAID 1 can approach double of that of RAID 0. However, if a substantial fraction 
of the I/O requests are write requests, then there may be no significant performance 
gain over RAID 0. RAID 1 may also provide improved performance over RAID 0 
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for data transfer intensive applications with a high percentage of reads. Improve- 
ment occurs if the application can split each read request so that both disk mem- 
bers participate. 


RAID Level 2 


RAID levels 2 and 3 make use of a parallel access technique. In a parallel access 
array, all member disks participate in the execution of every I/O request. Typically, 
the spindles of the individual drives are synchronized so that each disk head is in the 
same position on each disk at any given time. 

As in the other RAID schemes, data striping is used. In the case of RAID 2 
and 3, the strips are very small, often as small as a single byte or word. With RAID 
2, an error-correcting code is calculated across corresponding bits on each data disk, 
and the bits of the code are stored in the corresponding bit positions on multiple 
parity disks. Typically, a Hamming code is used, which is able to correct single-bit 
errors and detect double-bit errors. 

Although RAID 2 requires fewer disks than RAID 1, it is still rather costly. 
The number of redundant disks is proportional to the log of the number of data 
disks. On a single read, all disks are simultaneously accessed. The requested data 
and the associated error-correcting code are delivered to the array controller. If 
there is a single-bit error, the controller can recognize and correct the error in- 
stantly, so that the read access time is not slowed. On a single write, all data disks 
and parity disks must be accessed for the write operation. 

RAID 2 would only be an effective choice in an environment in which many 
disk errors occur. Given the high reliability of individual disks and disk drives, 
RAID 2 is overkill and is not implemented. 


RAID Level 3 


RAID 3 is organized in a similar fashion to RAID 2. The difference is that RAID 3 
requires only a single redundant disk, no matter how large the disk array. RAID 3 
employs parallel access, with data distributed in small strips. Instead of an error-cor- 
recting code, a simple parity bit is computed for the set of individual bits in the same 
position on all of the data disks. 


REDUNDANCY In the event of a drive failure, the parity drive is accessed and data is 
reconstructed from the remaining devices. Once the failed drive is replaced, the 
missing data can be restored on the new drive and operation resumed. 

Data reconstruction is simple. Consider an array of five drives in which XO 
through X3 contain data and X4 is the parity disk. The parity for the ith bit is calculated 
as follows: 


X4(i) = X3(i) © X2(i) ® X1(i) © X0(i) 


where @ is exclusive-OR function. 
Suppose that drive X1 has failed. If we add X4(i) © X1(Z) to both sides of the 
preceding equation, we get 


X1(i) = X4(i) © X3(i) © X2(i) ® X0(i) 
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Thus, the contents of each strip of data on X1 can be regenerated from the contents 
of the corresponding strips on the remaining disks in the array. This principle is true 
for RAID levels 3 through 6. 

In the event of a disk failure, all of the data are still available in what is re- 
ferred to as reduced mode. In this mode, for reads, the missing data are regenerated 
on the fly using the exclusive-OR calculation. When data are written to a reduced 
RAID 3 array, consistency of the parity must be maintained for later regeneration. 
Return to full operation requires that the failed disk be replaced and the entire con- 
tents of the failed disk be regenerated on the new disk. 


PERFORMANCE Because data are striped in very small strips, RAID 3 can achieve 
very high data transfer rates. Any I/O request will involve the parallel transfer of 
data from all of the data disks. For large transfers, the performance improvement is 
especially noticeable. On the other hand, only one I/O request can be executed at a 
time. Thus, in a transaction-oriented environment, performance suffers. 


RAID Level 4 


RAID levels 4 through 6 make use of an independent access technique. In an inde- 
pendent access array, each member disk operates independently, so that separate 
1/O requests can be satisfied in parallel. Because of this, independent access arrays 
are more suitable for applications that require high I/O request rates and are rela- 
tively less suited for applications that require high data transfer rates. 

As in the other RAID schemes, data striping is used. In the case of RAID 4 
through 6, the strips are relatively large. With RAID 4, a bit-by-bit parity strip is cal- 
culated across corresponding strips on each data disk, and the parity bits are stored 
in the corresponding strip on the parity disk. 

RAID 4 involves a write penalty when an I/O write request of small size is per- 
formed. Each time that a write occurs, the array management software must update 
not only the user data but also the corresponding parity bits. Consider an array of 
five drives in which XO through X3 contain data and X4 is the parity disk. Suppose 
that a write is performed that only involves a strip on disk X1. Initially, for each bit 
i, we have the following relationship: 


X4(i) = X3(i) ® X2(i) 6 X1() @ XO(A) (6.1) 
After the update, with potentially altered bits indicated by a prime symbol: 
X4'(i) = X3(i) ® X2(i) ® X1'(1)  XO(A) 
= X3(i) ® X2(i) ® X1'(4) S XO)  X1 (4) @ X14) 
= X3(i) ® X2(i) G X1(4) B X0G) @ X14) G X17) 
= X4(i) ® X1(i) ® X1'(a) 

The preceding set of equations is derived as follows. The first line shows that a 
change in X1 will also affect the parity disk X4. In the second line, we add the terms ® 
X1(2) ® X1(i)]. Because the exclusive-OR of any quantity with itself is 0, this does 
not affect the equation. However, it is a convenience that is used to create the third 


line, by reordering. Finally, Equation (6.1) is used to replace the first four terms by 
X4(i). 
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To calculate the new parity, the array management software must read the old 
user strip and the old parity strip. Then it can update these two strips with the new 
data and the newly calculated parity. Thus, each strip write involves two reads and 
two writes. 

In the case of a larger size I/O write that involves strips on all disk drives, 
parity is easily computed by calculation using only the new data bits. Thus, the par- 
ity drive can be updated in parallel with the data drives and there are no extra 
reads or writes. 

In any case, every write operation must involve the parity disk, which there- 
fore can become a bottleneck. 


RAID Level 5 


RAID 5 is organized in a similar fashion to RAID 4. The difference is that RAID 5 
distributes the parity strips across all disks. A typical allocation is a round-robin 
scheme, as illustrated in Figure 6.8f. For an n-disk array, the parity strip is on a differ- 
ent disk for the first n stripes, and the pattern then repeats. 

The distribution of parity strips across all drives avoids the potential I/O bottle- 
neck found in RAID 4. 


RAID Level 6 


RAID 6 was introduced in a subsequent paper by the Berkeley researchers 
[KATZ89]. In the RAID 6 scheme, two different parity calculations are carried out 
and stored in separate blocks on different disks. Thus, a RAID 6 array whose user 
data require N disks consists of N + 2 disks. 

Figure 6.8¢ illustrates the scheme. P and Q are two different data check algo- 
rithms. One of the two is the exclusive-OR calculation used in RAID 4 and 5. But 
the other is an independent data check algorithm. This makes it possible to regener- 
ate data even if two disks containing user data fail. 

The advantage of RAID 6 is that it provides extremely high data availability. 
Three disks would have to fail within the MTTR (mean time to repair) interval to 
cause data to be lost. On the other hand, RAID 6 incurs a substantial write penalty, 
because each write affects two parity blocks. Performance benchmarks [EISC07] 
show a RAID 6 controller can suffer more than a 30% drop in overall write perfor- 
mance compared with a RAID 5 implementation. RAID 5 and RAID 6 read per- 
formance is comparable. 

Table 6.4 is a comparative summary of the seven levels. 
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In 1983, one of the most successful consumer products of all time was introduced: 
the compact disk (CD) digital audio system. The CD is a nonerasable disk that can 
store more than 60 minutes of audio information on one side. The huge commercial 
success of the CD enabled the development of low-cost optical-disk storage tech- 
nology that has revolutionized computer data storage. A variety of optical-disk 
systems have been introduced (Table 6.5). We briefly review each of these. 


RAID Comparison 


Advantages 


Disadvantages 


Applications 


I/O performance is greatly 
improved by spreading the I/O 
load across many channels and 
drives 


No parity calculation overhead is 
involved 


Very simple design 
Easy to implement 


100% redundancy of data means 
no rebuild is necessary in case of a 
disk failure, just a copy to the 
replacement disk 


Under certain circumstances, 
RAID 1 can sustain multiple 
simultaneous drive failures 


Simplest RAID storage subsystem 
design 


The failure of just one 
drive will result in all 
data in an array being 
lost 


Highest disk overhead 
of all RAID types 
(100% )— inefficient 


Video production and 
editing 

Image Editing 
Pre-press applications 


Any application requiring 
high bandwidth 


Accounting 
Payroll 
Financial 


Any application requiring 
very high availability 


Extremely high data transfer rates 
possible 


The higher the data transfer rate 
required, the better the ratio of 
data disks to ECC disks 


Relatively simple controller design 
compared to RAID levels 3,4 & 5 


Very high ratio of ECC 
disks to data disks 

with smaller 

word sizes — 

inefficient 


Entry level cost very 
high—requires very 
high transfer rate 
requirement to justify 


No commercial 
implementations exist/ 
not commercially viable 


Very high read data transfer rate 
Very high write data transfer rate 


Disk failure has an insignificant 
impact on throughput 


Low ratio of ECC (parity) disks to 
data disks means high efficiency 


Transaction rate equal 
to that of a single disk 
drive at best (if 
spindles are 
synchronized) 


Controller design is 
fairly complex 


Video production and live 
streaming 


Image editing 

Video editing 

Prepress applications 
Any application requiring 
high throughput 


Very high Read data transaction rate 


Low ratio of ECC (parity) disks to 
data disks means high efficiency 


Quite complex 
controller design 


Worst write transaction 
rate and Write aggregate 
transfer rate 


Difficult and inefficient 
data rebuild in the event 
of disk failure 


No commercial 
implementations exist/ 
not commercially viable 


(Continued) 


Table 6.4 Continued 


Advantages 


Highest Read data transaction rate 


Low ratio of ECC (parity) disks to 
data disks means high efficiency 


Good aggregate transfer rate 
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Disadvantages 


Most complex 
controller design 


Difficult to rebuild in 
the event of a disk 
failure (as compared 
to RAID level 1) 


Applications 
File and application 
servers 
Database servers 


Web, e-mail, and 
news servers 


Intranet servers 
Most versatile RAID level 


Provides for an extremely high data 
fault tolerance and can sustain multi- 
ple simultaneous drive failures 


Table 6.5 Optical Disk Products 


CD 


More complex 
controller design 


Controller overhead 
to compute parity 
addresses is extremely 
high 


Perfect solution for 
mission critical 
applications 


Compact Disk. A nonerasable disk that stores digitized audio information. The standard system uses 
12-cm disks and can record more than 60 minutes of uninterrupted playing time. 


CD-ROM 


Compact Disk Read-Only Memory. A nonerasable disk used for storing computer data. The standard 
system uses 12-cm disks and can hold more than 650 Mbytes. 


CD-R 


CD Recordable. Similar to a CD-ROM. The user can write to the disk only once. 


CD-RW 


CD Rewritable. Similar to a CD-ROM. The user can erase and rewrite to the disk multiple times. 


DVD 


Digital Versatile Disk. A technology for producing digitized, compressed representation of video infor- 
mation, as well as large volumes of other digital data. Both 8 and 12 cm diameters are used, with a 
double-sided capacity of up to 17 Gbytes. The basic DVD is read-only (DVD-ROM). 


DVD-R 


DVD Recordable. Similar toa DVD-ROM. The user can write to the disk only once. Only one-sided 


disks can be used. 


DVD-RW 


DVD Rewritable. Similar to a DVD-ROM. The user can erase and rewrite to the disk multiple times. 


Only one-sided disks can be used. 


Blu-Ray DVD 


High definition video disk. Provides considerably greater data storage density than DVD, using a 405-nm 
(blue-violet) laser. A single layer on a single side can store 25 Gbytes. 
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Figure 6.10 CD Operation 


Compact Disk 


CD-ROM Both the audio CD and the CD-ROM (compact disk read-only memory) 
share a similar technology. The main difference is that CD-ROM players are more 
rugged and have error correction devices to ensure that data are properly transferred 
from disk to computer. Both types of disk are made the same way. The disk is formed 
from a resin, such as polycarbonate. Digitally recorded information (either music or 
computer data) is imprinted as a series of microscopic pits on the surface of the poly- 
carbonate. This is done, first of all, with a finely focused, high-intensity laser to create a 
master disk. The master is used, in turn, to make a die to stamp out copies onto poly- 
carbonate. The pitted surface is then coated with a highly reflective surface, usually 
aluminum or gold. This shiny surface is protected against dust and scratches by a top 
coat of clear acrylic. Finally, a label can be silkscreened onto the acrylic. 

Information is retrieved from a CD or CD-ROM by a low-powered laser 
housed in an optical-disk player, or drive unit. The laser shines through the clear 
polycarbonate while a motor spins the disk past it (Figure 6.10). The intensity of the 
reflected light of the laser changes as it encounters a pit. Specifically, if the laser 
beam falls on a pit, which has a somewhat rough surface, the light scatters and a low 
intensity is reflected back to the source. The areas between pits are called lands. A 
land is a smooth surface, which reflects back at higher intensity. The change between 
pits and lands is detected by a photosensor and converted into a digital signal. The 
sensor tests the surface at regular intervals. The beginning or end of a pit represents 
a 1; when no change in elevation occurs between intervals, a 0 is recorded. 

Recall that on a magnetic disk, information is recorded in concentric tracks. 
With the simplest constant angular velocity (CAV) system, the number of bits per 
track is constant. An increase in density is achieved with multiple zoned recording, 
in which the surface is divided into a number of zones, with zones farther from the 
center containing more bits than zones closer to the center. Although this technique 
increases capacity, it is still not optimal. 

To achieve greater capacity, CDs and CD-ROMs do not organize information 
on concentric tracks. Instead, the disk contains a single spiral track, beginning near 
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Figure 6.11 CD-ROM Block Format 


the center and spiraling out to the outer edge of the disk. Sectors near the outside of 
the disk are the same length as those near the inside. Thus, information is packed 
evenly across the disk in segments of the same size and these are scanned at the 
same rate by rotating the disk at a variable speed. The pits are then read by the laser 
at a constant linear velocity (CLV). The disk rotates more slowly for accesses near 
the outer edge than for those near the center. Thus, the capacity of a track and the 
rotational delay both increase for positions nearer the outer edge of the disk. The 
data capacity for a CD-ROM is about 680 MB. 

Data on the CD-ROM are organized as a sequence of blocks. A typical block 
format is shown in Figure 6.11. It consists of the following fields: 


e Syne: The sync field identifies the beginning of a block. It consists of a byte of 
all Os, 10 bytes of all 1s, and a byte of all Os. 

Header: The header contains the block address and the mode byte. Mode 0 
specifies a blank data field; mode 1 specifies the use of an error-correcting 
code and 2048 bytes of data; mode 2 specifies 2336 bytes of user data with no 
error-correcting code. 

Data: User data. 


Auxiliary: Additional user data in mode 2. In mode 1, this is a 288-byte error- 
correcting code. 


With the use of CLV, random access becomes more difficult. Locating a spe- 
cific address involves moving the head to the general area, adjusting the rotation 
speed and reading the address, and then making minor adjustments to find and ac- 
cess the specific sector. 

CD-ROM is appropriate for the distribution of large amounts of data to a 
large number of users. Because of the expense of the initial writing process, it is not 
appropriate for individualized applications. Compared with traditional magnetic 
disks, the CD-ROM has two advantages: 


e The optical disk together with the information stored on it can be mass 
replicated inexpensively—unlike a magnetic disk. The database on a mag- 
netic disk has to be reproduced by copying one disk at a time using two 
disk drives. 
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e The optical disk is removable, allowing the disk itself to be used for archival 
storage. Most magnetic disks are nonremovable. The information on nonre- 
movable magnetic disks must first be copied to another storage medium be- 
fore the disk drive/disk can be used to store new information. 


The disadvantages of CD-ROM are as follows: 


e Itis read-only and cannot be updated. 


e It has an access time much longer than that of a magnetic disk drive, as much 
as half a second. 


CD RECORDABLE To accommodate applications in which only one or a small 
number of copies of a set of data is needed, the write-once read-many CD, known as 
the CD recordable (CD-R), has been developed. For CD-R, a disk is prepared in 
such a way that it can be subsequently written once with a laser beam of modest 
intensity. Thus, with a somewhat more expensive disk controller than for CD-ROM, 
the customer can write once as well as read the disk. 

The CD-R medium is similar to but not identical to that of a CD or CD-ROM. 
For CDs and CD-ROMs, information is recorded by the pitting of the surface of the 
medium, which changes reflectivity. For a CD-R, the medium includes a dye layer. 
The dye is used to change reflectivity and is activated by a high-intensity laser. The 
resulting disk can be read on a CD-R drive or a CD-ROM drive. 

The CD-R optical disk is attractive for archival storage of documents and files. 
It provides a permanent record of large volumes of user data. 


CD REWRITABLE The CD-RW optical disk can be repeatedly written and overwrit- 
ten, as with a magnetic disk. Although a number of approaches have been tried, the 
only pure optical approach that has proved attractive is called phase change. The 
phase change disk uses a material that has two significantly different reflectivities in 
two different phase states. There is an amorphous state, in which the molecules ex- 
hibit a random orientation that reflects light poorly; and a crystalline state, which has 
a smooth surface that reflects light well. A beam of laser light can change the mater- 
ial from one phase to the other. The primary disadvantage of phase change optical 
disks is that the material eventually and permanently loses its desirable properties. 
Current materials can be used for between 500,000 and 1,000,000 erase cycles. 

The CD-RW has the obvious advantage over CD-ROM and CD-R that it can 
be rewritten and thus used as a true secondary storage. As such, it competes with 
magnetic disk. A key advantage of the optical disk is that the engineering tolerances 
for optical disks are much less severe than for high-capacity magnetic disks. Thus, 
they exhibit higher reliability and longer life. 


Digital Versatile Disk 


With the capacious digital versatile disk (DVD), the electronics industry has at last 
found an acceptable replacement for the analog VHS video tape. The DVD has re- 
placed the videotape used in video cassette recorders (VCRs) and, more important 
for this discussion, replace the CD-ROM in personal computers and servers. The 
DVD takes video into the digital age. It delivers movies with impressive picture qual- 
ity, and it can be randomly accessed like audio CDs, which DVD machines can also 
play. Vast volumes of data can be crammed onto the disk, currently seven times as 
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(b) DVD-ROM, double-sided, dual-layer-Capacity 17 GB 


Figure 6.12 CD-ROM and DVD-ROM 


much as a CD-ROM. With DVD’s huge storage capacity and vivid quality, PC games 
have become more realistic and educational software incorporates more video. Fol- 
lowing in the wake of these developments has been a new crest of traffic over the In- 
ternet and corporate intranets, as this material is incorporated into Web sites. 


The DVD’s greater capacity is due to three differences from CDs (Figure 6.12): 


. Bits are packed more closely on a DVD. The spacing between loops of a spiral 


on a CD is 1.6 wm and the minimum distance between pits along the spiral is 
0.834 wm.The DVD uses a laser with shorter wavelength and achieves a loop spac- 
ing of 0.74 wm and a minimum distance between pits of 0.4 wm. The result of these 
two improvements is about a seven-fold increase in capacity, to about 4.7 GB. 


. The DVD employs a second layer of pits and lands on top of the first layer. A dual- 


layer DVD has a semireflective layer on top of the reflective layer, and by adjust- 
ing focus, the lasers in DVD drives can read each layer separately. This technique 
almost doubles the capacity of the disk, to about 8.5 GB. The lower reflectivity of 
the second layer limits its storage capacity so that a full doubling is not achieved. 


. The DVD-ROM can be two sided, whereas data are recorded on only one side 


of a CD. This brings total capacity up to 17 GB. 
As with the CD, DVDs come in writeable as well as read-only versions (Table 6.5). 


210 CHAPTER 6 / EXTERNAL MEMORY 


cD 2.11 um 
7 eo a Data layer 
Beam spot + Land 
= of 4 ——_ 
Pit 1.2 um 
_ a a © > € 
ot 
Pare ee reed 
ey emer ee  yeee  e| Laser wavelength 
= 780 nm 
| 
1.32 wm 
ame ae om a» « 
CT ee ed eon ow oF 4 
am Ome = |e 0 = - 8 | 405 nm 
aa oe a or oe ee oe 
ome a 6 FF 4 fe ee - «4 
CO ee 
650 nm 


Figure 6.13 Optical Memory Characteristics 


High-Definition Optical Disks 


High-definition optical disks are designed to store high-definition videos and to 
provide significantly greater storage capacity compared to DVDs. The higher bit 
density is achieved by using a laser with a shorter wavelength, in the blue-violet 
range. The data pits, which constitute the digital 1s and 0s, are smaller on the high- 
definition optical disks compared to DVD because of the shorter laser wavelength. 

Two competing disk formats and technologies initially competed for market ac- 
ceptance: HD DVD and Blu-ray DVD. The Blu-ray scheme ultimately achieved market 
dominance. The HD DVD scheme can store 15 GB on a single layer on a single side. 
Blu-ray positions the data layer on the disk closer to the laser (shown on the right-hand 
side of each diagram in Figure 6.13). This enables a tighter focus and less distortion and 
thus smaller pits and tracks. Blu-ray can store 25 GB on a single layer. Three versions are 
available: read only (BD-ROM), recordable once (BD-R), and rerecordable (BD-RE). 


6.4 MAGNETIC TAPE 


Tape systems use the same reading and recording techniques as disk systems. The 
medium is flexible polyester (similar to that used in some clothing) tape coated with 
magnetizable material. The coating may consist of particles of pure metal in special 
binders or vapor-plated metal films. The tape and the tape drive are analogous to a 
home tape recorder system. Tape widths vary from 0.38 cm (0.15 inch) to 1.27 cm 
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(0.5 inch). Tapes used to be packaged as open reels that have to be threaded through 
a second spindle for use. Today, virtually all tapes are housed in cartridges. 

Data on the tape are structured as a number of parallel tracks running 
lengthwise. Earlier tape systems typically used nine tracks. This made it possible to 
store data one byte at a time, with an additional parity bit as the ninth track. This 
was followed by tape systems using 18 or 36 tracks, corresponding to a digital word 
or double word. The recording of data in this form is referred to as parallel record- 
ing. Most modern systems instead use serial recording, in which data are laid out 
as a sequence of bits along each track, as is done with magnetic disks. As with the 
disk, data are read and written in contiguous blocks, called physical records, on a 
tape. Blocks on the tape are separated by gaps referred to as interrecord gaps. As 
with the disk, the tape is formatted to assist in locating physical records. 

The typical recording technique used in serial tapes is referred to as 
serpentine recording. In this technique, when data are being recorded, the first set 
of bits is recorded along the whole length of the tape. When the end of the tape is 
reached, the heads are repositioned to record a new track, and the tape is again 
recorded on its whole length, this time in the opposite direction. That process con- 
tinues, back and forth, until the tape is full (Figure 6.14a). To increase speed, the 
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Figure 6.14 Typical Magnetic Tape Features 
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Table 6.6 LTO Tape Drives 


LTO-4 


Release date 2007 


Compressed 1600 GB 
capacity 


Compressed 
transfer rate 
(MB/s) 


Linear density 
(bits/mm) 


Tape tracks 


Tape length 


Tape width (cm) 


Write elements 


read-write head is capable of reading and writing a number of adjacent tracks 
simultaneously (typically two to eight tracks). Data are still recorded serially along 
individual tracks, but blocks in sequence are stored on adjacent tracks, as suggested 
by Figure 6.14b. 

A tape drive is a sequential-access device. If the tape head is positioned at 
record 1, then to read record N, it is necessary to read physical records 1 through N — 1, 
one at a time. If the head is currently positioned beyond the desired record, it is nec- 
essary to rewind the tape a certain distance and begin reading forward. Unlike the 
disk, the tape is in motion only during a read or write operation. 

In contrast to the tape, the disk drive is referred to as a direct-access device. A 
disk drive need not read all the sectors on a disk sequentially to get to the desired 
one. It must only wait for the intervening sectors within one track and can make suc- 
cessive accesses to any track. 

Magnetic tape was the first kind of secondary memory. It is still widely used as 
the lowest-cost, slowest-speed member of the memory hierarchy. 

The dominant tape technology today is a cartridge system known as linear 
tape-open (LTO). LTO was developed in the late 1990s as an open-source alterna- 
tive to the various proprietary systems on the market. Table 6.6 shows parameters 
for the various LTO generations. See Appendix J for details. 


6.5 RECOMMENDED READING AND WEB SITES 


[JACO08] provides solid coverage of magnetic disks. [|MEE96a] provides a good survey of the 
underlying recording technology of disk and tape systems. [MEE96b] focuses on the data 
storage techniques for disk and tape systems. [COME00] is a short but instructive article on 
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current trends in magnetic disk storage technology. [RADD08] and [ANDE03] provide a 
more recent discussion of magnetic disk storage technology. 

An excellent survey of RAID technology, written by the inventors of the RAID con- 
cept, is [CHEN94]. A good overview paper is [FRIE96]. A good performance comparison of 
the RAID architectures is [CHEN96]. 

[MARC90] gives an excellent overview of the optical storage field. A good survey of 
the underlying recording and reading technology is [MANS97]. 

[ROSC03] provides a comprehensive overview of all types of external memory sys- 
tems, with a modest amount of technical detail on each. [KHURO1] is another good survey. 

[HAEU07] provides a detailed treatment of LTO. 
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6.6 KEY TERMS, REVIEW QUESTIONS, AND PROBLEMS 


Key Terms 


access time 
Blu-ray 
CD 
CD-ROM 
CD-R 
CD-RW 


constant angular velocity 
(CAV) 
constant linear velocity (CLV) 


DVD-RW 
fixed-head disk 
floppy disk 

gap 

head 

land 

magnetic disk 
magnetic tape 
magnetoresistive 


pit 

platter 

RAID 

removable disk 
rotational delay 
sector 

seek time 

serpentine recording 
striped data 


cylinder 
DVD 


DVD-ROM 
DVD-R 


movable-head disk 
multiple zoned recording 
nonremovable disk 
optical memory 


substrate 
track 
transfer time 


Review Questions 


6.1 What are the advantages of using a glass substrate for a magnetic disk? 
6.2 How are data written onto a magnetic disk? 
6.3 How are data read from a magnetic disk? 
6.4 Explain the difference between a simple CAV system and a multiple zoned recording 
system. 
6.5 Define the terms track, cylinder, and sector. 
6.6 What is the typical disk sector size? 
6.7 Define the terms seek time, rotational delay, access time, and transfer time. 
6.8 What common characteristics are shared by all RAID levels? 
6.9 Briefly define the seven RAID levels. 
6.10 Explain the term striped data. 
6.11 How is redundancy achieved in a RAID system? 
6.12 In the context of RAID, what is the distinction between parallel access and indepen- 
dent access? 
6.13 What is the difference between CAV and CLV? 
6.14 What differences between a CD anda DVD account for the larger capacity of the latter? 
6.15 Explain serpentine recording. 
Problems 
6.1 Consider a disk with N tracks numbered from 0 to (N — 1) and assume that re- 


quested sectors are distributed randomly and evenly over the disk. We want to calcu- 

late the average number of tracks traversed by a seek. 

a. First, calculate the probability of a seek of length j when the head is currently po- 
sitioned over track t. Hint: This is a matter of determining the total number of 
combinations, recognizing that all track positions for the destination of the seek 
are equally likely. 


6.2 


6.4 


6.6 
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b. Next, calculate the probability of a seek of length K. Hint: this involves the sum- 
ming over all possible combinations of movements of K tracks. 

Calculate the average number of tracks traversed by a seek, using the formula for 
expected value 


£ 


E[x] = Six Priv =]] 


i=0 


n +e + 1)Qn +1 
Hint: Use the equalities: )i = me ) dS? = ie x = ) 
Ai 


i=1 


d. Show that for large values of N, the average number of tracks traversed by a seek 
approaches N/3. 


Define the following for a disk system: 
t, = seek time; average time to position head over track 
r = rotation speed of the disk, in revolutions per second 
n = number of bits per sector 
N = capacity of a track, in bits 
t, = time to access a sector 


Develop a formula for ¢, as a function of the other parameters. 


Consider a magnetic disk drive with 8 surfaces, 512 tracks per surface, and 64 sectors 

per track. Sector size is 1 KB. The average seek time is 8 ms, the track-to-track access 

time is 1.5 ms, and the drive rotates at 3600 rpm. Successive tracks in a cylinder can be 

read without head movement. 

a. What is the disk capacity? 

b. What is the average access time? Assume this file is stored in successive sectors 
and tracks of successive cylinders, starting at sector 0, track 0, of cylinder i. 

c. Estimate the time required to transfer a 5-MB file. 

d. What is the burst transfer rate? 


Consider a single-platter disk with the following parameters: rotation speed: 
7200 rpm; number of tracks on one side of platter: 30,000; number of sectors per 
track: 600; seek time: one ms for every hundred tracks traversed. Let the disk receive 
a request to access a random sector on a random track and assume the disk head 
starts at track 0. 

a. What is the average seek time? 

b. What is the average rotational latency? 

c. What is the transfer time for a sector? 

d. What is the total average time to satisfy a request? 


A distinction is made between physical records and logical records. A logical record is 
a collection of related data elements treated as a conceptual unit, independent of how 
or where the information is stored. A physical record is a contiguous area of storage 
space that is defined by the characteristics of the storage device and operating system. 
Assume a disk system in which each physical record contains thirty 120-byte logical 
records. Calculate how much disk space (in sectors, tracks, and surfaces) will be re- 
quired to store 300,000 logical records if the disk is fixed-sector with 512 bytes/sector, 
with 96 sectors/track, 110 tracks per surface, and 8 usable surfaces. Ignore any file 
header record(s) and track indexes, and assume that records cannot span two sectors. 


Consider a disk that rotates at 3600 rpm. The seek time to move the head between ad- 
jacent tracks is 2 ms. There are 32 sectors per track, which are stored in linear order 
from sector 0 through sector 31. The head sees the sectors in ascending order. Assume 
the read/write head is positioned at the start of sector 1 on track 8. There is a main 
memory buffer large enough to hold an entire track. Data is transferred between disk 
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6.7 


6.8 


6.9 


6.10 


locations by reading from the source track into the main memory buffer and then 

writing the date from the buffer to the target track. 

a. How long will it take to transfer sector 1 on track 8 to sector 1 on track 9? 

b. How long will it take to transfer all the sectors of track 8 to the corresponding sec- 
tors of track 9? 


It should be clear that disk striping can improve data transfer rate when the strip size 
is small compared to the I/O request size. It should also be clear that RAID 0 pro- 
vides improved performance relative to a single large disk, because multiple I/O re- 
quests can be handled in parallel. However, in this latter case, is disk striping 
necessary? That is, does disk striping improve I/O request rate performance com- 
pared to a comparable disk array without striping? 


Consider a 4-drive, 200GB-per-drive RAID array. What is the available data storage 
capacity for each of the RAID levels, 0, 1,3, 4,5, and 6? 


For a compact disk, audio is converted to digital with 16-bit samples, and is treated a 
stream of 8-bit bytes for storage. One simple scheme for storing this data, called direct 
recording, would be to represent a 1 by a land and a 0 by a pit. Instead, each byte is 
expanded into a 14-bit binary number. It turns out that exactly 256 (2°) of the total of 
16,134 (2!) 14-bit numbers have at least two 0s between every pair of 1s, and these 
are the numbers selected for the expansion from 8 to 14 bits. The optical system de- 
tects the presence of 1s by detecting a transition for pit to land or land to pit. It detects 
Os by measuring the distances between intensity changes. This scheme requires that 
there are no 1s in succession; hence the use of the 8-to-14 code. 

The advantage of this scheme is as follows. For a given laser beam diameter, 
there is a minimum-pit size, regardless of how the bits are represented. With this 
scheme, this minimum-pit size stores 3 bits, because at least two Os follow every 1. 
With direct recording, the same pit would be able to store only one bit. Considering 
both the number of bits stored per pit and the 8-to-14 bit expansion, which scheme 
stores the most bits and by what factor? 


Design a backup strategy for a computer system. One option is to use plug-in external 

disks, which cost $150 for each 500 GB drive. Another option is to buy a tape drive for 

$2500, and 400 GB tapes for $50 apiece. (These were realistic prices in 2008.) A typi- 

cal backup strategy is to have two sets of backup media onsite, with backups alter- 

nately written on them so in case the system fails while making a backup, the previous 

version is still intact. There’s also a third set kept offsite, with the offsite set periodi- 

cally swapped with an on-site set. 

a. Assume you have 1 TB (1000 GB) of data to back up. How much would a disk 
backup system cost? 

b. How much would a tape backup system cost for 1 TB? 

c. How large would each backup have to be in order for a tape strategy to be less 
expensive? 

d. What kind of backup strategy favors tapes? 
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KEY POINTS 


The computer system’s I/O architecture is its interface to the outside world. 
This architecture provides a systematic means of controlling interaction 
with the outside world and provides the operating system with the informa- 
tion it needs to manage I/O activity effectively. 


The are three principal I/O techniques: programmed I/O, in which I/O oc- 
curs under the direct and continuous control of the program requesting the 
1/O operation; interrupt-driven I/O, in which a program issues an I/O com- 
mand and then continues to execute, until it is interrupted by the I/O hard- 
ware to signal the end of the I/O operation; and direct memory access 
(DMA), in which a specialized I/O processor takes over control of an I/O 
operation to move a large block of data. 


Two important examples of external I/O interfaces are FireWire and 
Infiniband. 


In addition to the processor and a set of memory modules, the third key element of a 
computer system is a set of I/O modules. Each module interfaces to the system bus or 
central switch and controls one or more peripheral devices. An I/O module is not sim- 
ply a set of mechanical connectors that wire a device into the system bus. Rather, the 
1/O module contains logic for performing a communication function between the pe- 


I/O System Design Tool 


ripheral and the bus. 


The reader may wonder why one does not connect peripherals directly to the sys- 


tem bus. The reasons are as follows: 


There are a wide variety of peripherals with various methods of operation. It 
would be impractical to incorporate the necessary logic within the processor 


to control a range of devices. 


The data transfer rate of peripherals is often much slower than that of the 
memory or processor. Thus, it is impractical to use the high-speed system bus 


to communicate directly with a peripheral. 


On the other hand, the data transfer rate of some peripherals is faster than 
that of the memory or processor. Again, the mismatch would lead to ineffi- 


ciencies if not managed properly. 


Peripherals often use different data formats and word lengths than the com- 


puter to which they are attached. 
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Figure 7.1 Generic Model of an I/O Module 


Thus, an I/O module is required. This module has two major functions 
(Figure 7.1): 


e Interface to the processor and memory via the system bus or central switch 
e Interface to one or more peripheral devices by tailored data links 


We begin this chapter with a brief discussion of external devices, followed by an 
overview of the structure and function of an I/O module. Then we look at the various 
ways in which the I/O function can be performed in cooperation with the processor and 
memory: the internal I/O interface. Finally, we examine the external I/O interface, 
between the I/O module and the outside world. 


7.1. EXTERNAL DEVICES 


I/O operations are accomplished through a wide assortment of external devices that 
provide a means of exchanging data between the external environment and the 
computer. An external device attaches to the computer by a link to an I/O module 
(Figure 7.1). The link is used to exchange control, status, and data between the I/O 
module and the external device. An external device connected to an I/O module is 
often referred to as a peripheral device or, simply, a peripheral. 
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We can broadly classify external devices into three categories: 


e Human readable: Suitable for communicating with the computer user 
e Machine readable: Suitable for communicating with equipment 
e Communication: Suitable for communicating with remote devices 


Examples of human-readable devices are video display terminals (VDTs) and 
printers. Examples of machine-readable devices are magnetic disk and tape systems, 
and sensors and actuators, such as are used in a robotics application. Note that we 
are viewing disk and tape systems as I/O devices in this chapter, whereas in Chapter 6 
we viewed them as memory devices. From a functional point of view, these devices 
are part of the memory hierarchy, and their use is appropriately discussed in 
Chapter 6. From a structural point of view, these devices are controlled by I/O mod- 
ules and are hence to be considered in this chapter. 

Communication devices allow a computer to exchange data with a remote de- 
vice, which may be a human-readable device, such as a terminal, a machine-readable 
device, or even another computer. 

In very general terms, the nature of an external device is indicated in Figure 7.2. 
The interface to the I/O module is in the form of control, data, and status signals. 
Control signals determine the function that the device will perform, such as send 
data to the I/O module (INPUT or READ), accept data from the I/O module 
(OUTPUT or WRITE), report status, or perform some control function particular 
to the device (e.g., position a disk head). Data are in the form of a set of bits to be 
sent to or received from the I/O module. Status signals indicate the state of the de- 
vice. Examples are READY/NOT-READY to show whether the device is ready for 
data transfer. 


Control Status Data bits 
signals from signals to to and from 
1V/O module V/O module V/O module 
Control Buffer 
logic 
Transducer 


Data (device-unique) 
to and from 
environment 


Figure 7.2 Block Diagram of an External Device 
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Control logic associated with the device controls the device’s operation in re- 
sponse to direction from the I/O module. The transducer converts data from electri- 
cal to other forms of energy during output and from other forms to electrical during 
input. Typically, a buffer is associated with the transducer to temporarily hold data 
being transferred between the I/O module and the external environment; a buffer 
size of 8 to 16 bits is common. 

The interface between the I/O module and the external device will be exam- 
ined in Section 7.7. The interface between the external device and the environment 
is beyond the scope of this book, but several brief examples are given here. 


Keyboard/Monitor 


The most common means of computer/user interaction is a keyboard/monitor 
arrangement. The user provides input through the keyboard. This input is then 
transmitted to the computer and may also be displayed on the monitor. In addition, 
the monitor displays data provided by the computer. 

The basic unit of exchange is the character. Associated with each character is a 
code, typically 7 or 8 bits in length. The most commonly used text code is the Inter- 
national Reference Alphabet (IRA).! Each character in this code is represented by 
a unique 7-bit binary code; thus, 128 different characters can be represented. Char- 
acters are of two types: printable and control. Printable characters are the alpha- 
betic, numeric, and special characters that can be printed on paper or displayed on a 
screen. Some of the control characters have to do with controlling the printing or 
displaying of characters; an example is carriage return. Other control characters are 
concerned with communications procedures. See Appendix F for details. 

For keyboard input, when the user depresses a key, this generates an electronic 
signal that is interpreted by the transducer in the keyboard and translated into the 
bit pattern of the corresponding IRA code. This bit pattern is then transmitted to 
the I/O module in the computer. At the computer, the text can be stored in the same 
IRA code. On output, IRA code characters are transmitted to an external device 
from the I/O module. The transducer at the device interprets this code and sends the 
required electronic signals to the output device either to display the indicated char- 
acter or perform the requested control function. 


Disk Drive 


A disk drive contains electronics for exchanging data, control, and status signals 
with an I/O module plus the electronics for controlling the disk read/write mecha- 
nism. In a fixed-head disk, the transducer is capable of converting between the mag- 
netic patterns on the moving disk surface and bits in the device’s buffer (Figure 7.2). 
A moving-head disk must also be able to cause the disk arm to move radially in and 
out across the disk’s surface. 


'IRA is defined in ITU-T Recommendation T.50 and was formerly known as International Alphabet 
Number 5 (IA5). The U.S. national version of IRA is referred to as the American Standard Code for 
Information Interchange (ASCII). 
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Module Function 


The major functions or requirements for an I/O module fall into the following 
categories: 


e Control and timing 

e Processor communication 
e Device communication 

e Data buffering 


e Error detection 


During any period of time, the processor may communicate with one or more 
external devices in unpredictable patterns, depending on the program’s need for I/O. 
The internal resources, such as main memory and the system bus, must be shared 
among a number of activities, including data I/O. Thus, the I/O function includes a 
control and timing requirement, to coordinate the flow of traffic between internal re- 
sources and external devices. For example, the control of the transfer of data from an 
external device to the processor might involve the following sequence of steps: 


1. The processor interrogates the I/O module to check the status of the attached 
device. 


2. The I/O module returns the device status. 


3. If the device is operational and ready to transmit, the processor requests the 
transfer of data, by means of a command to the I/O module. 


4. The I/O module obtains a unit of data (e.g.,8 or 16 bits) from the external device. 
5. The data are transferred from the I/O module to the processor. 


If the system employs a bus, then each of the interactions between the proces- 
sor and the I/O module involves one or more bus arbitrations. 

The preceding simplified scenario also illustrates that the I/O module must 
communicate with the processor and with the external device. Processor communi- 
cation involves the following: 


e Command decoding: The I/O module accepts commands from the processor, 
typically sent as signals on the control bus. For example, an I/O module for a 
disk drive might accept the following commands: READ SECTOR, WRITE 
SECTOR, SEEK track number, and SCAN record ID. The latter two com- 
mands each include a parameter that is sent on the data bus. 


Data: Data are exchanged between the processor and the I/O module over the 
data bus. 


Status reporting: Because peripherals are so slow, it is important to know the 
status of the I/O module. For example, if an I/O module is asked to send data 
to the processor (read), it may not be ready to do so because it is still working 
on the previous I/O command. This fact can be reported with a status signal. 
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Common status signals are BUSY and READY. There may also be signals to 
report various error conditions. 


e Address recognition: Just as each word of memory has an address, so does 
each I/O device. Thus, an I/O module must recognize one unique address for 
each peripheral it controls. 


On the other side, the I/O module must be able to perform device communication. 
This communication involves commands, status information, and data (Figure 7.2). 

An essential task of an I/O module is data buffering. The need for this function 
is apparent from Figure 2.11. Whereas the transfer rate into and out of main mem- 
ory or the processor is quite high, the rate is orders of magnitude lower for many pe- 
ripheral devices and covers a wide range. Data coming from main memory are sent 
to an I/O module in a rapid burst. The data are buffered in the I/O module and then 
sent to the peripheral device at its data rate. In the opposite direction, data are 
buffered so as not to tie up the memory in a slow transfer operation. Thus, the I/O 
module must be able to operate at both device and memory speeds. Similarly, if the 
I/O device operates at a rate higher than the memory access rate, then the I/O mod- 
ule performs the needed buffering operation. 

Finally, an I/O module is often responsible for error detection and for subse- 
quently reporting errors to the processor. One class of errors includes mechanical and 
electrical malfunctions reported by the device (e.g., paper jam, bad disk track). Another 
class consists of unintentional changes to the bit pattern as it is transmitted from device 
to I/O module. Some form of error-detecting code is often used to detect transmission 
errors. A simple example is the use of a parity bit on each character of data. For exam- 
ple, the IRA character code occupies 7 bits of a byte. The eighth bit is set so that the total 
number of 1s in the byte is even (even parity) or odd (odd parity). When a byte is re- 
ceived, the I/O module checks the parity to determine whether an error has occurred. 


I/O Module Structure 


I/O modules vary considerably in complexity and the number of external devices 
that they control. We will attempt only a very general description here. (One specific 
device, the Intel 82C55A, is described in Section 7.4.) Figure 7.3 provides a general 
block diagram of an I/O module. The module connects to the rest of the computer 
through a set of signal lines (e.g., system bus lines). Data transferred to and from the 
module are buffered in one or more data registers. There may also be one or more 
status registers that provide current status information. A status register may also 
function as a control register, to accept detailed control information from the 
processor. The logic within the module interacts with the processor via a set of con- 
trol lines. The processor uses the control lines to issue commands to the I/O module. 
Some of the control lines may be used by the I/O module (e.g., for arbitration and 
status signals). The module must also be able to recognize and generate addresses 
associated with the devices it controls. Each I/O module has a unique address or, if 
it controls more than one external device, a unique set of addresses. Finally, the I/O 
module contains logic specific to the interface with each device that it controls. 

An I/O module functions to allow the processor to view a wide range of devices 
in a simple-minded way. There is a spectrum of capabilities that may be provided. The 
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Figure 7.3. Block Diagram of an I/O Module 


I/O module may hide the details of timing, formats, and the electromechanics of an 
external device so that the processor can function in terms of simple read and write 
commands, and possibly open and close file commands. In its simplest form, the I/O 
module may still leave much of the work of controlling a device (e.g., rewind a tape) 
visible to the processor. 

An I/O module that takes on most of the detailed processing burden, present- 
ing a high-level interface to the processor, is usually referred to as an I/O channel or 
I/O processor. An I/O module that is quite primitive and requires detailed control is 
usually referred to as an I/O controller or device controller. V/O controllers are com- 
monly seen on microcomputers, whereas I/O channels are used on mainframes. 

In what follows, we will use the generic term //O module when no confusion 
results and will use more specific terms where necessary. 


7.3 PROGRAMMED I/O 


Three techniques are possible for I/O operations. With programmed I/O, data are 
exchanged between the processor and the I/O module. The processor executes a 
program that gives it direct control of the I/O operation, including sensing device 
status, sending a read or write command, and transferring the data. When the 
processor issues a command to the I/O module, it must wait until the I/O operation 
is complete. If the processor is faster than the I/O module, this is wasteful of proces- 
sor time. With interrupt-driven I/O, the processor issues an I/O command, continues 
to execute other instructions, and is interrupted by the I/O module when the latter 
has completed its work. With both programmed and interrupt I/O, the processor is 
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Table 7.1 I/O Techniques 


No Interrupts Use of Interrupts 


1/O-to-memory transfer through processor | Programmed I/O Interrupt-driven I/O 


Direct I/O-to-memory transfer Direct memory access (DMA) 


responsible for extracting data from main memory for output and storing data in 
main memory for input. The alternative is known as direct memory access (DMA). 
In this mode, the I/O module and main memory exchange data directly, without 
processor involvement. 

Table 7.1 indicates the relationship among these three techniques. In this sec- 
tion, we explore programmed I/O. Interrupt I/O and DMA are explored in the fol- 
lowing two sections, respectively. 


Overview of Programmed I/O 


When the processor is executing a program and encounters an instruction relating 
to I/O, it executes that instruction by issuing a command to the appropriate I/O 
module. With programmed I/O, the I/O module will perform the requested action 
and then set the appropriate bits in the I/O status register (Figure 7.3). The I/O mod- 
ule takes no further action to alert the processor. In particular, it does not interrupt 
the processor. Thus, it is the responsibility of the processor periodically to check the 
status of the I/O module until it finds that the operation is complete. 

To explain the programmed I/O technique, we view it first from the point of 
view of the I/O commands issued by the processor to the I/O module, and then from 
the point of view of the I/O instructions executed by the processor. 


I/O Commands 


To execute an I/O-related instruction, the processor issues an address, specifying 
the particular I/O module and external device, and an I/O command. There are 
four types of I/O commands that an I/O module may receive when it is addressed 
by a processor: 


e Control: Used to activate a peripheral and tell it what to do. For example, a 
magnetic-tape unit may be instructed to rewind or to move forward one record. 
These commands are tailored to the particular type of peripheral device. 


e Test: Used to test various status conditions associated with an I/O module and 
its peripherals. The processor will want to know that the peripheral of interest 
is powered on and available for use. It will also want to know if the most recent 
I/O operation is completed and if any errors occurred. 


e Read: Causes the I/O module to obtain an item of data from the peripheral 
and place it in an internal buffer (depicted as a data register in Figure 7.3). The 
processor can then obtain the data item by requesting that the I/O module 
place it on the data bus. 


e Write: Causes the I/O module to take an item of data (byte or word) from the 
data bus and subsequently transmit that data item to the peripheral. 
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Figure 7.4 Three Techniques for Input of a Block of Data 


Figure 7.4a gives an example of the use of programmed I/O to read in a block of 
data from a peripheral device (e.g.,a record from tape) into memory. Data are read in 
one word (e.g., 16 bits) at a time. For each word that is read in, the processor must re- 
main in a status-checking cycle until it determines that the word is available in the I/O 
module’s data register. This flowchart highlights the main disadvantage of this tech- 
nique: it is a time-consuming process that keeps the processor busy needlessly. 


I/O Instructions 


With programmed I/O, there is a close correspondence between the I/O-related in- 
structions that the processor fetches from memory and the I/O commands that the 
processor issues to an I/O module to execute the instructions. That is, the instruc- 
tions are easily mapped into I/O commands, and there is often a simple one-to-one 
relationship. The form of the instruction depends on the way in which external de- 
vices are addressed. 

Typically, there will be many I/O devices connected through I/O modules to 
the system. Each device is given a unique identifier or address. When the processor 
issues an I/O command, the command contains the address of the desired device. 
Thus, each I/O module must interpret the address lines to determine if the com- 
mand is for itself. 
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When the processor, main memory, and I/O share a common bus, two modes 
of addressing are possible: memory mapped and isolated. With memory-mapped 
I/O, there is a single address space for memory locations and I/O devices. The 
processor treats the status and data registers of I/O modules as memory locations 
and uses the same machine instructions to access both memory and I/O devices. So, 
for example, with 10 address lines, a combined total of 2! = 1024 memory locations 
and I/O addresses can be supported, in any combination. 

With memory-mapped I/O, a single read line and a single write line are needed 
on the bus. Alternatively, the bus may be equipped with memory read and write plus 
input and output command lines. Now, the command line specifies whether the ad- 
dress refers to a memory location or an I/O device. The full range of addresses may 
be available for both. Again, with 10 address lines, the system may now support both 
1024 memory locations and 1024 I/O addresses. Because the address space for I/O is 
isolated from that for memory, this is referred to as isolated I/O. 

Figure 7.5 contrasts these two programmed I/O techniques. Figure 7.5a shows 
how the interface for a simple input device such as a terminal keyboard might ap- 
pear to a programmer using memory-mapped I/O. Assume a 10-bit address, with a 
512-bit memory (locations 0-511) and up to 512 I/O addresses (locations 512-1023). 
Two addresses are dedicated to keyboard input from a particular terminal. Address 
516 refers to the data register and address 517 refers to the status register, which 
also functions as a control register for receiving processor commands. The program 
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Figure 7.5 Memory-Mapped and Isolated I/O 
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shown will read 1 byte of data from the keyboard into an accumulator register in the 
processor. Note that the processor loops until the data byte is available. 

With isolated I/O (Figure 7.5b), the I/O ports are accessible only by special I/O 
commands, which activate the I/O command lines on the bus. 

For most types of processors, there is a relatively large set of different instruc- 
tions for referencing memory. If isolated I/O is used, there are only a few I/O in- 
structions. Thus, an advantage of memory-mapped I/O is that this large repertoire of 
instructions can be used, allowing more efficient programming. A disadvantage is 
that valuable memory address space is used up. Both memory-mapped and isolated 
1/O are in common use. 


7.4 INTERRUPT-DRIVEN I/O 


The problem with programmed I/O is that the processor has to wait a long time for the 
I/O module of concern to be ready for either reception or transmission of data. The 
processor, while waiting, must repeatedly interrogate the status of the I/O module. As 
a result, the level of the performance of the entire system is severely degraded. 

An alternative is for the processor to issue an I/O command to a module and then 
go on to do some other useful work. The I/O module will then interrupt the processor 
to request service when it is ready to exchange data with the processor. The processor 
then executes the data transfer, as before, and then resumes its former processing. 

Let us consider how this works, first from the point of view of the I/O module. 
For input, the I/O module receives a READ command from the processor. The I/O 
module then proceeds to read data in from an associated peripheral. Once the data 
are in the module’s data register, the module signals an interrupt to the processor 
over a control line. The module then waits until its data are requested by the proces- 
sor. When the request is made, the module places its data on the data bus and is then 
ready for another I/O operation. 

From the processor’s point of view, the action for input is as follows. The 
processor issues a READ command. It then goes off and does something else (e.g., 
the processor may be working on several different programs at the same time). At 
the end of each instruction cycle, the processor checks for interrupts (Figure 3.9). 
When the interrupt from the I/O module occurs, the processor saves the context 
(e.g., program counter and processor registers) of the current program and 
processes the interrupt. In this case, the processor reads the word of data from the 
I/O module and stores it in memory. It then restores the context of the program it 
was working on (or some other program) and resumes execution. 

Figure 7.4b shows the use of interrupt I/O for reading in a block of data. Com- 
pare this with Figure 7.4a. Interrupt I/O is more efficient than programmed I/O be- 
cause it eliminates needless waiting. However, interrupt I/O still consumes a lot of 
processor time, because every word of data that goes from memory to I/O module 
or from I/O module to memory must pass through the processor. 


Interrupt Processing 


Let us consider the role of the processor in interrupt-driven I/O in more detail. The 
occurrence of an interrupt triggers a number of events, both in the processor hardware 
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Figure 7.6 Simple Interrupt Processing 


and in software. Figure 7.6 shows a typical sequence. When an I/O device completes 
an I/O operation, the following sequence of hardware events occurs: 


1. The device issues an interrupt signal to the processor. 


2. The processor finishes execution of the current instruction before responding 
to the interrupt, as indicated in Figure 3.9. 


3. The processor tests for an interrupt, determines that there is one, and sends an 
acknowledgment signal to the device that issued the interrupt. The acknowl- 
edgment allows the device to remove its interrupt signal. 


4. The processor now needs to prepare to transfer control to the interrupt routine. 
To begin, it needs to save information needed to resume the current program at 
the point of interrupt. The minimum information required is (a) the status of the 
processor, which is contained in a register called the program status word (PSW), 
and (b) the location of the next instruction to be executed, which is contained in 
the program counter. These can be pushed onto the system control stack.” 


un 


The processor now loads the program counter with the entry location of the 
interrupt-handling program that will respond to this interrupt. Depending on 


See Appendix 10A for a discussion of stack operation. 


230 CHAPTER 7 / INPUT/OUTPUT 


the computer architecture and operating system design, there may be a single 
program; one program for each type of interrupt; or one program for each de- 
vice and each type of interrupt. If there is more than one interrupt-handling 
routine, the processor must determine which one to invoke. This information 
may have been included in the original interrupt signal, or the processor may 
have to issue a request to the device that issued the interrupt to get a response 
that contains the needed information. 


Once the program counter has been loaded, the processor proceeds to the 
next instruction cycle, which begins with an instruction fetch. Because the instruc- 
tion fetch is determined by the contents of the program counter, the result is that 
control is transferred to the interrupt-handler program. The execution of this pro- 
gram results in the following operations: 


6. At this point, the program counter and PSW relating to the interrupted program 
have been saved on the system stack. However, there is other information that is 
considered part of the “state” of the executing program. In particular, the contents 
of the processor registers need to be saved, because these registers may be used 
by the interrupt handler. So, all of these values, plus any other state information, 
need to be saved. Typically, the interrupt handler will begin by saving the contents 
of all registers on the stack. Figure 7.7a shows a simple example. In this case, a user 
program is interrupted after the instruction at location N. The contents of all of 
the registers plus the address of the next instruction (N + 1) are pushed onto the 
stack. The stack pointer is updated to point to the new top of stack, and the pro- 
gram counter is updated to point to the beginning of the interrupt service routine. 


7. The interrupt handler next processes the interrupt. This includes an examina- 
tion of status information relating to the I/O operation or other event that 
caused an interrupt. It may also involve sending additional commands or ac- 
knowledgments to the I/O device. 


8. When interrupt processing is complete, the saved register values are retrieved 
from the stack and restored to the registers (e.g., see Figure 7.7b). 


9. The final act is to restore the PSW and program counter values from the stack. 
As a result, the next instruction to be executed will be from the previously in- 
terrupted program. 


Note that it is important to save all the state information about the interrupted 
program for later resumption. This is because the interrupt is not a routine called 
from the program. Rather, the interrupt can occur at any time and therefore at any 
point in the execution of a user program. Its occurrence is unpredictable. Indeed, as 
we will see in the next chapter, the two programs may not have anything in common 
and may belong to two different users. 


Design Issues 


Two design issues arise in implementing interrupt I/O. First, because there will al- 
most invariably be multiple I/O modules, how does the processor determine which 
device issued the interrupt? And second, if multiple interrupts have occurred, how 
does the processor decide which one to process? 
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Figure 7.7 Changes in Memory and Registers for an Interrupt 


Let us consider device identification first. Four general categories of tech- 
niques are in common use: 


e Multiple interrupt lines 


e Software poll 


e Daisy chain (hardware poll, vectored) 


e Bus arbitration (vectored) 


The most straightforward approach to the problem is to provide multiple inter- 
rupt lines between the processor and the I/O modules. However, it is impractical to 
dedicate more than a few bus lines or processor pins to interrupt lines. Consequently, 
even if multiple lines are used, it is likely that each line will have multiple I/O mod- 
ules attached to it. Thus, one of the other three techniques must be used on each line. 
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One alternative is the software poll. When the processor detects an interrupt, 
it branches to an interrupt-service routine whose job it is to poll each I/O module to 
determine which module caused the interrupt. The poll could be in the form of a 
separate command line (e.g., TESTI/O). In this case, the processor raises TESTI/O 
and places the address of a particular I/O module on the address lines. The I/O mod- 
ule responds positively if it set the interrupt. Alternatively, each I/O module could 
contain an addressable status register. The processor then reads the status register 
of each I/O module to identify the interrupting module. Once the correct module is 
identified, the processor branches to a device-service routine specific to that device. 

The disadvantage of the software poll is that it is ttme consuming. A more effi- 
cient technique is to use a daisy chain, which provides, in effect, a hardware poll. An 
example of a daisy-chain configuration is shown in Figure 3.26. For interrupts, all 
1/O modules share a common interrupt request line. The interrupt acknowledge line 
is daisy chained through the modules. When the processor senses an interrupt, it 
sends out an interrupt acknowledge. This signal propagates through a series of I/O 
modules until it gets to a requesting module. The requesting module typically re- 
sponds by placing a word on the data lines. This word is referred to as a vector and is 
either the address of the I/O module or some other unique identifier. In either case, 
the processor uses the vector as a pointer to the appropriate device-service routine. 
This avoids the need to execute a general interrupt-service routine first. This tech- 
nique is called a vectored interrupt. 

There is another technique that makes use of vectored interrupts, and that is 
bus arbitration. With bus arbitration, an I/O module must first gain control of the 
bus before it can raise the interrupt request line. Thus, only one module can raise the 
line at a time. When the processor detects the interrupt, it responds on the interrupt 
acknowledge line. The requesting module then places its vector on the data lines. 

The aforementioned techniques serve to identify the requesting I/O module. 
They also provide a way of assigning priorities when more than one device is re- 
questing interrupt service. With multiple lines, the processor just picks the interrupt 
line with the highest priority. With software polling, the order in which modules are 
polled determines their priority. Similarly, the order of modules on a daisy chain de- 
termines their priority. Finally, bus arbitration can employ a priority scheme, as dis- 
cussed in Section 3.4. 

We now turn to two examples of interrupt structures. 


Intel 82C59A Interrupt Controller 


The Intel 80386 provides a single Interrupt Request (INTR) and a single Interrupt 
Acknowledge (INTA) line. To allow the 80386 to handle a variety of devices and pri- 
ority structures, it is usually configured with an external interrupt arbiter, the 82C59A. 
External devices are connected to the 82C59A, which in turn connects to the 80386. 

Figure 7.8 shows the use of the 82C59A to connect multiple I/O modules for the 
80386. A single 82C59A can handle up to eight modules. If control for more than eight 
modules is required, a cascade arrangement can be used to handle up to 64 modules. 

The 82C59A’s sole responsibility is the management of interrupts. It accepts 
interrupt requests from attached modules, determines which interrupt has the high- 
est priority, and then signals the processor by raising the INTR line. The processor 
acknowledges via the INTA line. This prompts the 82C59A to place the appropriate 
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Figure 7.8 Use of the 82C59A Interrupt Controller 
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vector information on the data bus. The processor can then proceed to process the 

interrupt and to communicate directly with the I/O module to read or write data. 
The 82C59A is programmable. The 80386 determines the priority scheme to 

be used by setting a control word in the 82C59A. The following interrupt modes are 


possible: 


e Fully nested: The interrupt requests are ordered in priority from 0 (IRO) 


through 7 (IR7) 
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e Rotating: In some applications a number of interrupting devices are of equal 
priority. In this mode a device, after being serviced, receives the lowest priority 
in the group. 


e Special mask: This allows the processor to inhibit interrupts from certain devices. 


The Intel 82C55A Programmable Peripheral Interface 


As an example of an I/O module used for programmed I/O and interrupt-driven 
1/O, we consider the Intel 82C55A Programmable Peripheral Interface. The 82C55A 
is a single-chip, general-purpose I/O module designed for use with the Intel 80386 
processor. Figure 7.9 shows a general block diagram plus the pin assignment for the 
40-pin package in which it is housed. 

The right side of the block diagram is the external interface of the 82C55A. 
The 24 I/O lines are programmable by the 80386 by means of the control register. 
The 80386 can set the value of the control register to specify a variety of operating 
modes and configurations. The 24 lines are divided into three 8-bit groups (A, B, C). 
Each group can function as an 8-bit I/O port. In addition, group C is subdivided into 
4-bit groups (C4 and Cg), which may be used in conjunction with the A and B I/O 
ports. Configured in this manner, group C lines carry control and status signals. 

The left side of the block diagram is the internal interface to the 80386 bus. It 
includes an 8-bit bidirectional data bus (DO through D7), used to transfer data to 
and from the I/O ports and to transfer control information to the control register. 
The two address lines specify one of the three I/O ports or the control register. A 
transfer takes place when the CHIP SELECT line is enabled together with either 
the READ or WRITE line. The RESET line is used to initialize the module. 
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Figure 7.9 The Intel 82C55A Programmable Peripheral Interface 
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The control register is loaded by the processor to control the mode of opera- 
tion and to define signals, if any. In Mode 0 operation, the three groups of eight ex- 
ternal lines function as three 8-bit I/O ports. Each port can be designated as input or 
output. Otherwise, groups A and B function as I/O ports, and the lines of group C 
serve as control lines for A and B. The control signals serve two principal purposes: 
“handshaking” and interrupt request. Handshaking is a simple timing mechanism. 
One control line is used by the sender as a DATA READY line, to indicate when 
the data are present on the I/O data lines. Another line is used by the receiver as an 
ACKNOWLEDGE, indicating that the data have been read and the data lines may 
be cleared. Another line may be designated as an INTERRUPT REQUEST line 
and tied back to the system bus. 

Because the 82C55A is programmable via the control register, it can be used 
to control a variety of simple peripheral devices. Figure 7.10 illustrates its use to 
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control a keyboard/display terminal. The keyboard provides 8 bits of input. Two of 
these bits, SHIFT and CONTROL, have special meaning to the keyboard-handling 
program executing in the processor. However, this interpretation is transparent to 
the 82C55A, which simply accepts the 8 bits of data and presents them on the system 
data bus. Two handshaking control lines are provided for use with the keyboard. 

The display is also linked by an 8-bit data port. Again, two of the bits have spe- 
cial meanings that are transparent to the 82CS55A. In addition to two handshaking 
lines, two lines provide additional control functions. 


7.5 DIRECT MEMORY ACCESS 


Drawbacks of Programmed and Interrupt-Driven I/O 


Interrupt-driven I/O, though more efficient than simple programmed I/O, still re- 
quires the active intervention of the processor to transfer data between memory and 
an I/O module, and any data transfer must traverse a path through the processor. 
Thus, both these forms of I/O suffer from two inherent drawbacks: 


1. The I/O transfer rate is limited by the speed with which the processor can test 
and service a device. 


2. The processor is tied up in managing an I/O transfer; a number of instructions 
must be executed for each I/O transfer (e.g., Figure 7.5). 


There is somewhat of a trade-off between these two drawbacks. Consider the 
transfer of a block of data. Using simple programmed I/O, the processor is dedicated 
to the task of I/O and can move data at a rather high rate, at the cost of doing noth- 
ing else. Interrupt I/O frees up the processor to some extent at the expense of the 
I/O transfer rate. Nevertheless, both methods have an adverse impact on both 
processor activity and I/O transfer rate. 

When large volumes of data are to be moved, a more efficient technique is re- 
quired: direct memory access (DMA). 


DMA Function 


DMA involves an additional module on the system bus. The DMA module 
(Figure 7.11) is capable of mimicking the processor and, indeed, of taking over con- 
trol of the system from the processor. It needs to do this to transfer data to and from 
memory over the system bus. For this purpose, the DMA module must use the bus 
only when the processor does not need it, or it must force the processor to suspend 
operation temporarily. The latter technique is more common and is referred to as 
cycle stealing, because the DMA module in effect steals a bus cycle. 

When the processor wishes to read or write a block of data, it issues a command 
to the DMA module, by sending to the DMA module the following information: 


e Whether a read or write is requested, using the read or write control line be- 
tween the processor and the DMA module 
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Figure 7.11 Typical DMA Block Diagram 


e The address of the I/O device involved, communicated on the data lines 


e The starting location in memory to read from or write to, communicated on 
the data lines and stored by the DMA module in its address register 


e The number of words to be read or written, again communicated via the data 
lines and stored in the data count register 


The processor then continues with other work. It has delegated this I/O oper- 
ation to the DMA module. The DMA module transfers the entire block of data, one 
word at a time, directly to or from memory, without going through the processor. 
When the transfer is complete, the DMA module sends an interrupt signal to the 
processor. Thus, the processor is involved only at the beginning and end of the trans- 
fer (Figure 7.4c). 

Figure 7.12 shows where in the instruction cycle the processor may be sus- 
pended. In each case, the processor is suspended just before it needs to use the 
bus. The DMA module then transfers one word and returns control to the 
processor. Note that this is not an interrupt; the processor does not save a con- 
text and do something else. Rather, the processor pauses for one bus cycle. The 
overall effect is to cause the processor to execute more slowly. Nevertheless, for 
a multiple-word I/O transfer, DMA is far more efficient than interrupt-driven or 
programmed I/O. 

The DMA mechanism can be configured in a variety of ways. Some possi- 
bilities are shown in Figure 7.13. In the first example, all modules share the same 
system bus. The DMA module, acting as a surrogate processor, uses programmed 
1/O to exchange data between memory and an I/O module through the DMA 
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Figure 7.12 DMA and Interrupt Breakpoints during an Instruction Cycle 
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Figure 7.14 8237 DMA Usage of System Bus 


module. This configuration, while it may be inexpensive, is clearly inefficient. As 
with processor-controlled programmed I/O, each transfer of a word consumes 
two bus cycles. 

The number of required bus cycles can be cut substantially by integrating the 
DMA and I/O functions. As Figure 7.13b indicates, this means that there is a path 
between the DMA module and one or more I/O modules that does not include 
the system bus. The DMA logic may actually be a part of an I/O module, or it may 
be a separate module that controls one or more I/O modules. This concept can be 
taken one step further by connecting I/O modules to the DMA module using an 
I/O bus (Figure 7.13c). This reduces the number of I/O interfaces in the DMA 
module to one and provides for an easily expandable configuration. In both of 
these cases (Figures 7.13b and c), the system bus that the DMA module shares 
with the processor and memory is used by the DMA module only to exchange 
data with memory. The exchange of data between the DMA and I/O modules 
takes place off the system bus. 


Intel 8237A DMA Controller 


The Intel 8237A DMA controller interfaces to the 80x86 family of processors and to 
DRAM memory to provide a DMA capability. Figure 7.14 indicates the location of 
the DMA module. When the DMA module needs to use the system buses (data, ad- 
dress, and control) to transfer data, it sends a signal called HOLD to the processor. 
The processor responds with the HLDA (hold acknowledge) signal, indicating that 
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the DMA module can use the buses. For example, if the DMA module is to transfer 
a block of data from memory to disk, it will do the following: 


1. The peripheral device (such as the disk controller) will request the service of 
DMA by pulling DREQ (DMA request) high. 


2. The DMA will put a high on its HRO (hold request), signaling the CPU 
through its HOLD pin that it needs to use the buses. 


3. The CPU will finish the present bus cycle (not necessarily the present instruc- 
tion) and respond to the DMA request by putting high on its HDLA (hold ac- 
knowledge), thus telling the 8237 DMA that it can go ahead and use the buses 
to perform its task. HOLD must remain active high as long as DMA is per- 
forming its task. 


4. DMA will activate DACK (DMA acknowledge), which tells the peripheral de- 
vice that it will start to transfer the data. 


5. DMA starts to transfer the data from memory to peripheral by putting the ad- 
dress of the first byte of the block on the address bus and activating MEMR, 
thereby reading the byte from memory into the data bus; it then activates I|OW 
to write it to the peripheral. Then DMA decrements the counter and incre- 
ments the address pointer and repeats this process until the count reaches zero 
and the task is finished. 


6. After the DMA has finished its job it will deactivate HRQ, signaling the CPU 
that it can regain control over its buses. 


While the DMA is using the buses to transfer data, the processor is idle. Simi- 
larly, when the processor is using the bus, the DMA is idle. The 8237 DMA is known 
as a fly-by DMA controller. This means that the data being moved from one loca- 
tion to another does not pass through the DMA chip and is not stored in the DMA 
chip. Therefore, the DMA can only transfer data between an I/O port and a memory 
address, but not between two I/O ports or two memory locations. However, as ex- 
plained subsequently, the DMA chip can perform a memory-to-memory transfer via 
a register. 

The 8237 contains four DMA channels that can be programmed indepen- 
dently, and any one of the channels may be active at any moment. These channels 
are numbered 0, 1,2, and 3. 

The 8237 has a set of five control/command registers to program and control 
DMA operation over one of its channels (Table 7.2): 


e Command: The processor loads this register to control the operation of the 
DMA. D0 enables a memory-to-memory transfer, in which channel 0 is used to 
transfer a byte into an 8237 temporary register and channel 1 is used to transfer 
the byte from the register to memory. When memory-to-memory is enabled, D1 
can be used to disable increment/decrement on channel 0 so that a fixed value 
can be written into a block of memory. D2 enables or disables DMA. 


Status: The processor reads this register to determine DMA status. Bits DO-D3 
are used to indicate if channels 0-3 have reached their TC (terminal count). 
Bits D4-D7 are used by the processor to determine if any channel has a DMA 
request pending. 


Tve 


Table 7.2 


Intel 8237A Registers 


Command 


Status 


Mode 


Single Mask 


All Mask 


Memory-to-memory E/D 


Channel 0 address 
hold E/D 


Channel 0 has reached TC 
Channel 1 has reached TC 


Channel select 


Select channel mask bit 


Clear/set channel 0 mask bit 


Clear/set channel 1 mask bit 


Controller E/D 


Normal/compressed timing 


Channel 2 has reached TC 
Channel 3 has reached TC 


Verify/write/ read transfer 


Fixed/rotating priority 


Channel 0 request 


Auto-initialization E/D 


Late/extended write 
selection 


Channel 0 request 


Address increment/ 
decrement select 


DREQ sense active 
high/low 


DACK sense active 
high/low 


enable/disable 


= terminal count 


Channel 0 request 


Channel 0 request 


Demand/single/block/ 
cascade mode select 


Clear/set mask bit 


Not used 


Clear/set channel 2 mask bit 


Clear/set channel 3 mask bit 


Not used 
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Mode: The processor sets this register to determine the mode of operation of 
the DMA. Bits DO and D1 are used to select a channel. The other bits select 
various operation modes for the selected channel. Bits D2 and D3 determine if 
the transfer is a from an I/O device to memory (write) or from memory to I/O 
(read), or a verify operation. If D4 is set, then the memory address register and 
the count register are reloaded with their original values at the end of a DMA 
data transfer. Bits D6 and D7 determine the way in which the 8237 is used. In 
single mode, a single byte of data is transferred. Block and demand modes are 
used for a block transfer, with the demand mode allowing for premature end- 
ing of the transfer. Cascade mode allows multiple 8237s to be cascaded to ex- 
pand the number of channels to more than 4. 


Single Mask: The processor sets this register. Bits DO and D1 select the chan- 
nel. Bit D2 clears or sets the mask bit for that channel. It is through this regis- 
ter that the DREQ input of a specific channel can be masked (disabled) or 
unmasked (enabled). While the command register can be used to disable the 
whole DMA chip, the single mask register allows the programmer to disable 
or enable a specific channel. 


All Mask: This register is similar to the single mask register except that all four 
channels can be masked or unmasked with one write operation. 


In addition, the 8237A has eight data registers: one memory address register 


and one count register for each channel. The processor sets these registers to indi- 
cate the location of size of main memory to be affected by the transfers. 


7.6 I/O CHANNELS AND PROCESSORS 


The Evolution of the I/O Function 


As computer systems have evolved, there has been a pattern of increasing complex- 
ity and sophistication of individual components. Nowhere is this more evident than 
in the I/O function. We have already seen part of that evolution. The evolutionary 
steps can be summarized as follows: 


1. 


2 


The CPU directly controls a peripheral device. This is seen in simple micro- 
processor-controlled devices. 

A controller or I/O module is added. The CPU uses programmed I/O without 
interrupts. With this step, the CPU becomes somewhat divorced from the spe- 
cific details of external device interfaces. 


. The same configuration as in step 2 is used, but now interrupts are employed. 


The CPU need not spend time waiting for an I/O operation to be performed, 
thus increasing efficiency. 


. The I/O module is given direct access to memory via DMA. It can now move a 


block of data to or from memory without involving the CPU, except at the 
beginning and end of the transfer. 
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5. The I/O module is enhanced to become a processor in its own right, with a spe- 
cialized instruction set tailored for I/O. The CPU directs the I/O processor to 
execute an I/O program in memory. The I/O processor fetches and executes 
these instructions without CPU intervention. This allows the CPU to specify a 
sequence of I/O activities and to be interrupted only when the entire sequence 
has been performed. 


6. The I/O module has a local memory of its own and is, in fact, a computer 
in its own right. With this architecture, a large set of I/O devices can be 
controlled, with minimal CPU involvement. A common use for such an 
architecture has been to control communication with interactive terminals. 
The I/O processor takes care of most of the tasks involved in controlling 
the terminals. 


As one proceeds along this evolutionary path, more and more of the I/O 
function is performed without CPU involvement. The CPU is increasingly re- 
lieved of I/O-related tasks, improving performance. With the last two steps (5-6), 
a major change occurs with the introduction of the concept of an I/O module ca- 
pable of executing a program. For step 5, the I/O module is often referred to as an 
I/O channel. For step 6, the term I/O processor is often used. However, both 
terms are on occasion applied to both situations. In what follows, we will use the 
term I/O channel. 


Characteristics of I/O Channels 


The I/O channel represents an extension of the DMA concept. An I/O channel has 
the ability to execute I/O instructions, which gives it complete control over I/O 
operations. In a computer system with such devices, the CPU does not execute I/O 
instructions. Such instructions are stored in main memory to be executed by a 
special-purpose processor in the I/O channel itself. Thus, the CPU initiates an I/O 
transfer by instructing the I/O channel to execute a program in memory. The pro- 
gram will specify the device or devices, the area or areas of memory for storage, pri- 
ority, and actions to be taken for certain error conditions. The I/O channel follows 
these instructions and controls the data transfer. 

Two types of I/O channels are common, as illustrated in Figure 7.15. A 
selector channel controls multiple high-speed devices and, at any one time, is ded- 
icated to the transfer of data with one of those devices. Thus, the I/O channel 
selects one device and effects the data transfer. Each device, or a small set of de- 
vices, is handled by a controller, or I/O module, that is much like the I/O modules 
we have been discussing. Thus, the I/O channel serves in place of the CPU in con- 
trolling these I/O controllers. A multiplexor channel can handle I/O with multiple 
devices at the same time. For low-speed devices, a byte multiplexor accepts or 
transmits characters as fast as possible to multiple devices. For example, the resul- 
tant character stream from three devices with different rates and individual streams 
A A,A3A4 sey B,B.B3By, ays and Cy C,C3C, eee might be A,B,C, A,C,A3B,C3Ag, 
and so on. For high-speed devices, a block multiplexor interleaves blocks of data 
from several devices. 
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Figure 7.15 I/O Channel Architecture 


7.7 THE EXTERNAL INTERFACE: FIREWIRE AND INFINIBAND 


Types of Interfaces 


The interface to a peripheral from an I/O module must be tailored to the nature and 
operation of the peripheral. One major characteristic of the interface is whether it is 
serial or parallel (Figure 7.16). In a parallel interface, there are multiple lines con- 
necting the I/O module and the peripheral, and multiple bits are transferred simul- 
taneously, just as all of the bits of a word are transferred simultaneously over the 
data bus. In a serial interface, there is only one line used to transmit data, and bits 
must be transmitted one at a time. A parallel interface has traditionally been used 
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Figure 7.16 Parallel and Serial I/O 


for higher-speed peripherals, such as tape and disk, while the serial interface has tra- 
ditionally been used for printers and terminals. With a new generation of high-speed 
serial interfaces, parallel interfaces are becoming much less common. 

In either case, the I/O module must engage in a dialogue with the peripheral. 
In general terms, the dialogue for a write operation is as follows: 


1. The I/O module sends a control signal requesting permission to send data. 
The peripheral acknowledges the request. 


Ye N 


The I/O module transfers data (one word or a block depending on the periph- 
eral). 


4. The peripheral acknowledges receipt of the data. 


A read operation proceeds similarly. 

Key to the operation of an I/O module is an internal buffer that can store data 
being passed between the peripheral and the rest of the system. This buffer allows 
the I/O module to compensate for the differences in speed between the system bus 
and its external lines. 


Point-to-Point and Multipoint Configurations 


The connection between an I/O module in a computer system and external devices 
can be either point-to-point or multipoint. A point-to-point interface provides a 
dedicated line between the I/O module and the external device. On small systems 
(PCs, workstations), typical point-to-point links include those to the keyboard, 
printer, and external modem. A typical example of such an interface is the E[A-232 
specification (see [STALO07] for a description). 

Of increasing importance are multipoint external interfaces, used to sup- 
port external mass storage devices (disk and tape drives) and multimedia devices 
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(CD-ROMs, video, audio). These multipoint interfaces are in effect external buses, 
and they exhibit the same type of logic as the buses discussed in Chapter 3. In this 
section, we look at two key examples: FireWire and Infiniband. 


FireWire Serial Bus 


With processor speeds reaching gigahertz range and storage devices holding multi- 
ple gigabits, the I/O demands for personal computers, workstations, and servers are 
formidable. Yet the high-speed I/O channel technologies that have been developed 
for mainframe and supercomputer systems are too expensive and bulky for use on 
these smaller systems. Accordingly, there has been great interest in developing a 
high-speed alternative to Small Computer System Interface (SCSI) and other small- 
system I/O interfaces. The result is the IEEE standard 1394, for a High Performance 
Serial Bus, commonly known as FireWire. 

FireWire has a number of advantages over older I/O interfaces. It is very 
high speed, low cost, and easy to implement. In fact, FireWire is finding favor not 
only for computer systems, but also in consumer electronics products, such as dig- 
ital cameras, DVD players/recorders, and televisions. In these products, Fire Wire 
is used to transport video images, which are increasingly coming from digi- 
tized sources. 

One of the strengths of the FireWire interface is that it uses serial transmission 
(bit at a time) rather than parallel. Parallel interfaces, such as SCSI, require more 
wires, which means wider, more expensive cables and wider, more expensive con- 
nectors with more pins to bend or break. A cable with more wires requires shielding 
to prevent electrical interference between the wires. Also, with a parallel interface, 
synchronization between wires becomes a requirement, a problem that gets worse 
with increased cable length. 

In addition, computers are getting physically smaller even as they expand in 
computing power and I/O needs. Handheld and pocket-size computers have little 
room for connectors yet need high data rates to handle images and video. 

The intent of FireWire is to provide a single I/O interface with a simple con- 
nector that can handle numerous devices through a single port, so that the mouse, 
laser printer, external disk drive, sound, and local area network hookups can be re- 
placed with this single connector. 


FIREWIRE CONFIGURATIONS FireWire uses a daisy-chain configuration, with up 
to 63 devices connected off a single port. Moreover, up to 1022 FireWire buses can 
be interconnected using bridges, enabling a system to support as many peripherals 
as required. 

FireWire provides for what is known as hot plugging, which makes it possible 
to connect and disconnect peripherals without having to power the computer sys- 
tem down or reconfigure the system. Also, FireWire provides for automatic configu- 
ration; it is not necessary manually to set device IDs or to be concerned with the 
relative position of devices. Figure 7.17 shows a simple FireWire configuration. With 
FireWire, there are no terminations, and the system automatically performs a con- 
figuration function to assign addresses. Also note that a FireWire bus need not be a 
strict daisy chain. Rather, a tree-structured configuration is possible. 
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Figure 7.17 Simple FireWire Configuration 
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An important feature of the FireWire standard is that it specifies a set of three 
layers of protocols to standardize the way in which the host system interacts with 
the peripheral devices over the serial bus. Figure 7.18 illustrates this stack. The three 
layers of the stack are as follows: 


e Physical layer: Defines the transmission media that are permissible under 
Fire Wire and the electrical and signaling characteristics of each 


Transaction layer 


|<<} 
(read, write, lock) 
Asynchronous Isochronous 
Link layer 
—_—_—_— 


Packet transmitter Packet receiver Cycle control 


Serial bus management 


Physical layer 
Arbitration Data resynch Encode/decode 
Connectors/media Connection state Signal levels 


Figure 7.18 FireWire Protocol Stack 
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e Link layer: Describes the transmission of data in the packets 


e Transaction layer: Defines a request-response protocol that hides the lower- 
layer details of FireWire from applications 


PHYSICAL LAYER The physical layer of FireWire specifies several alternative trans- 
mission media and their connectors, with different physical and data transmission 
properties. Data rates from 25 to 3200 Mbps are defined. The physical layer con- 
verts binary data into electrical signals for various physical media. This layer also 
provides the arbitration service that guarantees that only one device at a time will 
transmit data. 

Two forms of arbitration are provided by FireWire. The simplest form is based 
on the tree-structured arrangement of the nodes on a FireWire bus, mentioned ear- 
lier. A special case of this structure is a linear daisy chain. The physical layer con- 
tains logic that allows all the attached devices to configure themselves so that one 
node is designated as the root of the tree and other nodes are organized in a par- 
ent/child relationship forming the tree topology. Once this configuration is estab- 
lished, the root node acts as a central arbiter and processes requests for bus access in 
a first-come-first-served fashion. In the case of simultaneous requests, the node with 
the highest natural priority is granted access. The natural priority is determined by 
which competing node is closest to the root and, among those of equal distance from 
the root, which one has the lower ID number. 

The aforementioned arbitration method is supplemented by two additional 
functions: fairness arbitration and urgent arbitration. With fairness arbitration, time 
on the bus is organized into fairness intervals. At the beginning of an interval, each 
node sets an arbitration_enable flag. During the interval, each node may compete 
for bus access. Once a node has gained access to the bus, it resets its arbitration 
enable flag and may not again compete for fair access during this interval. This 
scheme makes the arbitration fairer, in that it prevents one or more busy high- 
priority devices from monopolizing the bus. 

In addition to the fairness scheme, some devices may be configured as having 
urgent priority. Such nodes may gain control of the bus multiple times during a fair- 
ness interval. In essence, a counter is used at each high-priority node that enables 
the high-priority nodes to control 75% of the available bus time. For each packet 
that is transmitted as nonurgent, three packets may be transmitted as urgent. 


LINK LAYER The link layer defines the transmission of data in the form of packets. 
Two types of transmission are supported: 


e Asynchronous: A variable amount of data and several bytes of transaction 
layer information are transferred as a packet to an explicit address and an ac- 
knowledgment is returned. 


e Isochronous: A variable amount of data is transferred in a sequence of fixed- 
size packets transmitted at regular intervals. This form of transmission uses 
simplified addressing and no acknowledgment. 


Asynchronous transmission is used by data that have no fixed data rate 
requirements. Both the fair arbitration and urgent arbitration schemes may be used 
for asynchronous transmission. The default method is fair arbitration. Devices that 
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Figure 7.19 FireWire Subactions 


desire a substantial fraction of the bus capacity or have severe latency requirements 
use the urgent arbitration method. For example, a high-speed real-time data collection 
node may use urgent arbitration when critical data buffers are more than half full. 


ing a 


Figure 7.19a depicts a typical asynchronous transaction. The process of deliver- 


single packet is called a subaction. The subaction consists of five time periods: 


Arbitration sequence: This is the exchange of signals required to give one 
device control of the bus. 


Packet transmission: Every packet includes a header containing the source 
and destination IDs. The header also contains packet type information, a CRC 
(cyclic redundancy check) checksum, and parameter information for the spe- 
cific packet type. A packet may also include a data block consisting of user 
data and another CRC. 


Acknowledgment gap: This is the time delay for the destination to receive and 
decode a packet and generate an acknowledgment. 


Acknowledgment: The recipient of the packet returns an acknowledgment 
packet with a code indicating the action taken by the recipient. 


Subaction gap: This is an enforced idle period to ensure that other nodes on 
the bus do not begin arbitrating before the acknowledgment packet has been 
transmitted. 


At the time that the acknowledgment is sent, the acknowledging node is in 


control of the bus. Therefore, if the exchange is a request/response interaction be- 
tween two nodes, then the responding node can immediately transmit the response 
packet without going through an arbitration sequence (Figure 7.19b). 
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For devices that regularly generate or consume data, such as digital sound or 
video, isochronous access is provided. This method guarantees that data can be de- 
livered within a specified latency with a guaranteed data rate. 

To accommodate a mixed traffic load of isochronous and asynchronous data 
sources, one node is designated as cycle master. Periodically, the cycle master issues a 
cycle_start packet. This signals all other nodes that an isochronous cycle has begun. 
During this cycle, only isochronous packets may be sent (Figure 7.19c). Each isochro- 
nous data source arbitrates for bus access. The winning node immediately transmits a 
packet. There is no acknowledgment to this packet, and so other isochronous data 
sources immediately arbitrate for the bus after the previous isochronous packet is 
transmitted. The result is that there is a small gap between the transmission of one 
packet and the arbitration period for the next packet, dictated by delays on the bus. 
This delay, referred to as the isochronous gap, is smaller than a subaction gap. 

After all isochronous sources have transmitted, the bus will remain idle long 
enough for a subaction gap to occur. This is the signal to the asynchronous sources 
that they may now compete for bus access. Asynchronous sources may then use the 
bus until the beginning of the next isochronous cycle. 

Isochronous packets are labeled with 8-bit channel numbers that are previ- 
ously assigned by a dialogue between the two nodes that are to exchange isochro- 
nous data. The header, which is shorter than that for asynchronous packets, also 
includes a data length field and a header CRC. 


InfiniBand 


InfiniBand is a recent I/O specification aimed at the high-end server market.* The 
first version of the specification was released in early 2001 and has attracted numer- 
ous vendors. The standard describes an architecture and specifications for data flow 
among processors and intelligent I/O devices. InfiniBand has become a popular in- 
terface for storage area networking and other large storage configurations. In 
essence, InfiniBand enables servers, remote storage, and other network devices to 
be attached in a central fabric of switches and links. The switch-based architecture 
can connect up to 64,000 servers, storage systems, and networking devices. 


INFINIBAND ARCHITECTURE Although PCI is a reliable interconnect method 
and continues to provide increased speeds, up to 4 Gbps, it is a limited architecture 
compared to Infiniband. With InfiniBand, it is not necessary to have the basic I/O 
interface hardware inside the server chassis. With InfiniBand, remote storage, net- 
working, and connections between servers are accomplished by attaching all de- 
vices to a central fabric of switches and links. Removing I/O from the server chassis 
allows greater server density and allows for a more flexible and scalable data cen- 
ter, as independent nodes may be added as needed. 

Unlike PCI, which measures distances from a CPU motherboard in centime- 
ters, InfiniBand’s channel design enables I/O devices to be placed up to 17 meters 
away from the server using copper, up to 300 m using multimode optical fiber, and 


3Infiniband is the result of the merger of two competing projects: Future I/O (backed by Cisco, HP, Com- 
paq, and IBM) and Next Generation I/O (developed by Intel and backed by a number of other companies). 
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Figure 7.20 InfiniBand Switch Fabric 


up to 10 km with single-mode optical fiber. Transmission rates has high as 30 Gbps 
can be achieved. 


Figure 7.20 illustrates the InfiniBand architecture. The key elements are as 


follows: 


Host channel adapter (HCA): Instead of a number of PCI slots, a typical 
server needs a single interface to an HCA that links the server to an Infini- 
Band switch. The HCA attaches to the server at a memory controller, which 
has access to the system bus and controls traffic between the processor and 
memory and between the HCA and memory. The HCA uses direct-memory 
access (DMA) to read and write memory. 


Target channel adapter (TCA): A TCA is used to connect storage systems, 
routers, and other peripheral devices to an InfiniBand switch. 


InfiniBand switch: A switch provides point-to-point physical connections to a 
variety of devices and switches traffic from one link to another. Servers and 
devices communicate through their adapters, via the switch. The switch’s intel- 
ligence manages the linkage without interrupting the servers’ operation. 


Links: The link between a switch and a channel adapter, or between two switches. 


Subnet: A subnet consists of one or more interconnected switches plus the 
links that connect other devices to those switches. Figure 7.20 shows a subnet 
with a single switch, but more complex subnets are required when a large 
number of devices are to be interconnected. Subnets allow administrators to 
confine broadcast and multicast transmissions within the subnet. 


Router: Connects InfiniBand subnets, or connects an Infiniband switch to a net- 
work, such as a local area network, wide area network, or storage area network. 
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The channel adapters are intelligent devices that handle all I/O functions with- 
out the need to interrupt the server’s processor. For example, there is a control pro- 
tocol by which a switch discovers all TCAs and HCAs in the fabric and assigns 
logical addresses to each. This is done without processor involvement. 

The Infiniband switch temporarily opens up channels between the processor 
and devices with which it is communicating. The devices do not have to share a 
channel’s capacity, as is the case with a bus-based design such as PCI, which requires 
that devices arbitrate for access to the processor. Additional devices are added to 
the configuration by hooking up each device’s TCA to the switch. 


INFINIBAND OPERATION Each physical link between a switch and an attached 
interface (HCA or TCA) can be support up to 16 logical channels, called virtual 
lanes. One lane is reserved for fabric management and the other lanes for data 
transport. Data are sent in the form of a stream of packets, with each packet 
containing some portion of the total data to be transferred, plus addressing and 
control information. Thus, a set of communications protocols are used to manage 
the transfer of data. A virtual lane is temporarily dedicated to the transfer of data 
from one end node to another over the InfiniBand fabric. The InfiniBand switch 
maps traffic from an incoming lane to an outgoing lane to route the data between 
the desired end points. 

Figure 7.21 indicates the logical structure used to support exchanges over 
InfiniBand. To account for the fact that some devices can send data faster than an- 
other destination device can receive it, a pair of queues at both ends of each link 
temporarily buffers excess outbound and inbound data. The queues can be located 
in the channel adapter or in the attached device’s memory. A separate pair of 
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Figure 7.21 InfiniBand Communication Protocol Stack 
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Table 7.3 InfiniBand Links and Data Throughput Rates 


Signal rate Usable capacity (80% Effective data throughput 
(unidirectional) of signal rate) (send + receive) 


2.5 Gbps 2 Gbps (250 MBps) (250 + 250) MBps 
10 Gbps 8 Gbps (1 GBps) (1 + 1) GBps 


12-wide 24 Gbps (3 GBps) (3 + 3) Gbps 


queues is used for each virtual lane. The host uses these queues in the following 
fashion. The host places a transaction, called a work queue entry (WQE) into either 
the send or receive queue of the queue pair. The two most important WQEs are 
SEND and RECEIVE. For a SEND operation, the WQE specifies a block of data in 
the device’s memory space for the hardware to send to the destination. A RECEIVE 
WOE specifies where the hardware is to place data received from another device 
when that consumer executes a SEND operation. The channel adapter processes 
each posted WQE in the proper prioritized order and generates a completion queue 
entry (CQE) to indicate the completion status. 

Figure 7.21 also indicates that a layered protocol architecture is used, consist- 
ing of four layers: 


e Physical: The physical-layer specification defines three link speeds (1X, 4X, 
and 12X) giving transmission rates of 2.5, 10,and 30 Gbps, respectively (Table 7.3). 
The physical layer also defines the physical media, including copper and opti- 
cal fiber. 


e Link: This layer defines the basic packet structure used to exchange data, 
including an addressing scheme that assigns a unique link address to every 
device in a subnet. This level includes the logic for setting up virtual lanes 
and for switching data through switches from source to destination within a 
subnet. The packet structure includes an error-detection code to provide 
reliability. 

e Network: The network layer routes packets between different InfiniBand 
subnets. 


e Transport: The transport layer provides reliability mechanism for end-to-end 
transfer of packets across one or more subnets. 


7.8 RECOMMENDED READING AND WEB SITES 


A good discussion of Intel I/O modules and architecture, including the 82C59A, 82C55A, and 
8237A, can be found in [BREY09] and [MAZI03]. 

FireWire is covered in great detail in [ANDE98]. [WICK97] and [THOMO0] provide 
concise overviews of Fire Wire. 

InfiniBand is covered in great detail in [SHANO03] and [FUTRO1]. [KAGA01] provides 
a concise overview. 
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Recommended Web sites: 


e T10 Home Page: T10 is a Technical Committee of the National Committee on Infor- 
mation Technology Standards and is responsible for lower-level interfaces. Its principal 
work is the Small Computer System Interface (SCSI). 

e 1394 Trade Association: Includes technical information and vendor pointers on 
FireWire. 

e Infiniband Trade Association: Includes technical information and vendor pointers 
on Infiniband. 

e National Facility for /O Characterization and Optimization: A facility dedi- 
cated to education and research in the area of I/O design and performance. Useful 
tools and tutorials. 


7.9 KEY TERMS, REVIEW QUESTIONS, AND PROBLEMS 


Key Terms 


cycle stealing I/O channel multiplexor channel 
direct memory access (DMA) 1/O command parallel I/O 


Fire Wire 


I/O module peripheral device 


InfiniBand I/O processor programmed I/O 


interrupt 
interrupt 


isolated I/O selector channel 
-driven I/O memory-mapped I/O serial I/O 
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Review Questions 


71 
7.2 
7.3 
7.4 
7.5 
7.6 


Tel: 


List three broad classifications of external, or peripheral, devices. 

What is the International Reference Alphabet? 

What are the major functions of an I/O module? 

List and briefly define three techniques for performing I/O. 

What is the difference between memory-mapped I/O and isolated I/O? 

When a device interrupt occurs, how does the processor determine which device 
issued the interrupt? 

When a DMA module takes control of a bus, and while it retains control of the bus, 
what does the processor do? 


Problems 


71 


7.2 


7.4 


7.5 


7.6 


On a typical microprocessor, a distinct I/O address is used to refer to the I/O data reg- 
isters and a distinct address for the control and status registers in an I/O controller for 
a given device. Such registers are referred to as ports. In the Intel 8088, two I/O in- 
struction formats are used. In one format, the 8-bit opcode specifies an I/O operation; 
this is followed by an 8-bit port address. Other I/O opcodes imply that the port ad- 
dress is in the 16-bit DX register. How many ports can the 8088 address in each I/O 
addressing mode? . 


A similar instruction format is used in the Zilog Z8000 microprocessor family. In this 
case, there is a direct port addressing capability, in which a 16-bit port address is part 
of the instruction, and an indirect port addressing capability, in which the instruction 
references one of the 16-bit general purpose registers, which contains the port 
address. How many ports can the Z8000 address in each I/O addressing mode? 


The Z8000 also includes a block I/O transfer capability that, unlike DMA, is under 
the direct control of the processor. The block transfer instructions specify a port ad- 
dress register (Rp), a count register (Rc), and a destination register (Rd). Rd contains 
the main memory address at which the first byte read from the input port is to be 
stored. Rc is any of the 16-bit general purpose registers. How large a data block can 
be transferred? 


Consider a microprocessor that has a block I/O transfer instruction such as that found 
on the Z8000. Following its first execution, such an instruction takes five clock cycles 
to re-execute. However, if we employ a nonblocking I/O instruction, it takes a total of 
20 clock cycles for fetching and execution. Calculate the increase in speed with the 
block I/O instruction when transferring blocks of 128 bytes. 


A system is based on an 8-bit microprocessor and has two I/O devices. The I/O con- 

trollers for this system use separate control and status registers. Both devices handle 

data on a 1-byte-at-a-time basis. The first device has two status lines and three control 

lines. The second device has three status lines and four control lines. 

a. How many 8-bit I/O control module registers do we need for status reading and 
control of each device? 

b. What is the total number of needed control module registers given that the first 
device is an output-only device? 

c. How many distinct addresses are needed to control the two devices? 


For programmed I/O, Figure 7.5 indicates that the processor is stuck in a wait loop 

doing status checking of an I/O device. To increase efficiency, the I/O software could 

be written so that the processor periodically checks the status of the device. If the de- 

vice is not ready, the processor can jump to other tasks. After some timed interval, the 

processor comes back to check status again. 

a. Consider the above scheme for outputting data one character at a time to a 
printer that operates at 10 characters per second (cps). What will happen if its sta- 
tus is scanned every 200 ms? 


256 CHAPTER 7 / INPUT/OUTPUT 


7.7 


7.8 


7.9 


7.10 


7.11 


7.12 


7.14 


b. Next consider a keyboard with a single character buffer. On average, characters 
are entered at a rate of 10 cps. However, the time interval between two consecu- 
tive key depressions can be as short as 60 ms. At what frequency should the key- 
board be scanned by the I/O program? 


A microprocessor scans the status of an output I/O device every 20 ms. This is accom- 
plished by means of a timer alerting the processor every 20 ms. The interface of the 
device includes two ports: one for status and one for data output. How long does it 
take to scan and service the device given a clocking rate of 8 MHz? Assume for sim- 
plicity that all pertinent instruction cycles take 12 clock cycles. 


In Section 7.3, one advantage and one disadvantage of memory-mapped I/O, compared 
with isolated I/O, were listed. List two more advantages and two more disadvantages. 


A particular system is controlled by an operator through commands entered from a 

keyboard. The average number of commands entered in an 8-hour interval is 60. 

a. Suppose the processor scans the keyboard every 100 ms. How many times will the 
keyboard be checked in an 8-hour period? 

b. By what fraction would the number of processor visits to the keyboard be reduced 
if interrupt-driven I/O were used? 


Consider a system employing interrupt-driven I/O for a particular device that trans- 

fers data at an average of 8 KB/s on a continuous basis. 

a. Assume that interrupt processing takes about 100 ys (i-e., the time to jump to the 
interrupt service routine (ISR), execute it, and return to the main program). De- 
termine what fraction of processor time is consumed by this I/O device if it inter- 
rupts for every byte. 

b. Now assume that the device has two 16-byte buffers and interrupts the processor 
when one of the buffers is full. Naturally, interrupt processing takes longer, be- 
cause the ISR must transfer 16 bytes. While executing the ISR, the processor takes 
about 8 ws for the transfer of each byte. Determine what fraction of processor 
time is consumed by this I/O device in this case. 

c. Now assume that the processor is equipped with a block transfer I/O instruction 
such as that found on the Z8000. This permits the associated ISR to transfer each 
byte of a block in only 2 ws. Determine what fraction of processor time is con- 
sumed by this I/O device in this case. 


In virtually all systems that include DMA modules, DMA access to main memory is 
given higher priority than CPU access to main memory. Why? 


A DMA module is transferring characters to memory using cycle stealing, from a de- 
vice transmitting at 9600 bps. The processor is fetching instructions at the rate of 
1 million instructions per second (1 MIPS). By how much will the processor be slowed 
down due to the DMA activity? 


Consider a system in which bus cycles takes 500 ns. Transfer of bus control in either 
direction, from processor to I/O device or vice versa, takes 250 ns. One of the I/O de- 
vices has a data transfer rate of 50 KB/s and employs DMA. Data are transferred one 
byte at a time. 

a. Suppose we employ DMA in a burst mode. That is, the DMA interface gains bus 
mastership prior to the start of a block transfer and maintains control of the bus 
until the whole block is transferred. For how long would the device tie up the bus 
when transferring a block of 128 bytes? 

b. Repeat the calculation for cycle-stealing mode. 


Examination of the timing diagram of the 8237A indicates that once a block transfer 

begins, it takes three bus clock cycles per DMA cycle. During the DMA cycle, the 

8237A transfers one byte of information between memory and I/O device. 

a. Suppose we clock the 8237A at a rate of 5 MHz. How long does it take to transfer 
one byte? 

b. What would be the maximum attainable data transfer rate? 

c. Assume that the memory is not fast enough and we have to insert two wait states 
per DMA cycle. What will be the actual data transfer rate? 
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Assume that in the system of the preceding problem, a memory cycle takes 750 ns. To 
what value could we reduce the clocking rate of the bus without effect on the attain- 
able data transfer rate? 

A DMA controller serves four receive-only telecommunication links (one per DMA 
channel) having a speed of 64 Kbps each. 

a. Would you operate the controller in burst mode or in cycle-stealing mode? 

b. What priority scheme would you employ for service of the DMA channels? 

A 32-bit computer has two selector channels and one multiplexor channel. Each se- 
lector channel supports two magnetic disk and two magnetic tape units. The multi- 
plexor channel has two line printers, two card readers, and 10 VDT terminals 
connected to it. Assume the following transfer rates: 


Disk drive 800 KBytes/s 
Magnetic tape drive 200 KBytes/s 
Line printer 6.6 KBytes/s 
Card reader 1.2 KBytes/s 
VDT 1 KBytes/s 


Estimate the maximum aggregate I/O transfer rate in this system. 


A computer consists of a processor and an I/O device D connected to main memory 

M via a shared bus with a data bus width of one word. The processor can execute a 

maximum of 10° instructions per second. An average instruction requires five ma- 

chine cycles, three of which use the memory bus. A memory read or write operation 
uses one machine cycle. Suppose that the processor is continuously executing 

“background” programs that require 95% of its instruction execution rate but not 

any I/O instructions. Assume that one processor cycle equals one bus cycle. Now 

suppose the I/O device is to be used to transfer very large blocks of data between M 

and D. 

a. Ifprogrammed I/O is used and each one-word I/O transfer requires the processor 
to execute two instructions, estimate the maximum I/O data-transfer rate, in 
words per second, possible through D. 

b. Estimate the same rate if DMA is used. 


A data source produces 7-bit IRA characters, to each of which is appended a parity 

bit. Derive an expression for the maximum effective data rate (rate of IRA data bits) 

over an R-bps line for the following: 

a. Asynchronous transmission, with a 1.5-unit stop bit 

b. Bit-synchronous transmission, with a frame consisting of 48 control bits and 128 
information bits 

c. Same as (b), with a 1024-bit information field 

d. Character-synchronous, with 9 control characters per frame and 16 information 
characters 

e. Same as (d), with 128 information characters 


The following problem is based on a suggested illustration of I/O mechanisms in 

[ECKE90] (Figure 7.22): 

Two women are on either side of a high fence. One of the women, named 
Apple-server, has a beautiful apple tree loaded with delicious apples growing on her 
side of the fence; she is happy to supply apples to the other woman whenever needed. 
The other woman, named Apple-eater, loves to eat apples but has none. In fact, she 
must eat her apples at a fixed rate (an apple a day keeps the doctor away). If she eats 
them faster than that rate, she will get sick. If she eats them slower, she will suffer mal- 
nutrition. Neither woman can talk, and so the problem is to get apples from Apple- 
server to Apple-eater at the correct rate. 

a. Assume that there is an alarm clock sitting on top of the fence and that the clock 
can have multiple alarm settings. How can the clock be used to solve the problem? 
Draw a timing diagram to illustrate the solution. 

b. Now assume that there is no alarm clock. Instead Apple-eater has a flag that she 
can wave whenever she needs an apple. Suggest a new solution. Would it be 
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Figure 7.22 An Apple Problem 


helpful for Apple-server also to have a flag? If so, incorporate this into the solu- 
tion. Discuss the drawbacks of this approach. 

c. Now take away the flag and assume the existence of a long piece of string. Suggest 
a solution that is superior to that of (b) using the string. 

7.21 Assume that one 16-bit and two 8-bit microprocessors are to be interfaced to a system 

bus. The following details are given: 

1. All microprocessors have the hardware features necessary for any type of data 
transfer: programmed I/O, interrupt-driven I/O, and DMA. 

2. All microprocessors have a 16-bit address bus. 

3. Two memory boards, each of 64 KBytes capacity, are interfaced with the bus. The 
designer wishes to use a shared memory that is as large as possible. 

4. The system bus supports a maximum of four interrupt lines and one DMA line. 
Make any other assumptions necessary, and 

a. Give the system bus specifications in terms of number and types of lines. 

b. Describe a possible protocol for communicating on the bus (i.e., read-write, inter- 
rupt, and DMA sequences). 

c. Explain how the aforementioned devices are interfaced to the system bus. 
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KEY POINTS 


© The operating system (OS) is the software that controls the execution of 
programs on a processor and that manages the processor’s resources. A 
number of the functions performed by the OS, including process scheduling 
and memory management, can only be performed efficiently and rapidly if 
the processor hardware includes capabilities to support the OS. Virtually all 
processors include such capabilities to a greater or lesser extent, including 
virtual memory management hardware and process management hard- 
ware. The hardware includes special purpose registers and buffers, as well 
as circuitry to perform basic resource management tasks. 


® One of the most important functions of the OS is the scheduling of 
processes, or tasks. The OS determines which process should run at any 
given time. Typically, the hardware will interrupt a running process from 
time to time to enable the OS to make a new scheduling decision so as to 
share processor time fairly among a number of processes. 


® Another important OS function is memory management. Most contempo- 
rary operating systems include a virtual memory capability, which has two 
benefits: (1) A process can run in main memory without all of the instruc- 
tions and data for that program being present in main memory at one time, 
and (2) the total memory space available to a program may far exceed the 
actual main memory on the system. Although memory management is per- 
formed in software, the OS relies on hardware support in the processor, 
including paging and segmentation hardware. 


Although the focus of this text is computer hardware, there is one area of software that 
needs to be addressed: the computer’s OS. The OS is a program that manages the com- 
puter’s resources, provides services for programmers, and schedules the execution of 
other programs. Some understanding of operating systems is essential to appreciate the 
mechanisms by which the CPU controls the computer system. In particular, explana- 
tions of the effect of interrupts and of the management of the memory hierarchy are 
best explained in this context. 

The chapter begins with an overview and brief history of operating systems. The 
bulk of the chapter looks at the two OS functions that are most relevant to the study of 
computer organization and architecture: scheduling and memory management. 


8.1 OPERATING SYSTEM OVERVIEW 


Operating System Objectives and Functions 


An OS is a program that controls the execution of application programs and acts as 
an interface between the user of a computer and the computer hardware. It can be 
thought of as having two objectives: 


8.1 / OPERATING SYSTEM OVERVIEW 261 


e Convenience: An OS makes a computer more convenient to use. 


e Efficiency: An OS allows the computer system resources to be used in an effi- 
cient manner. 


Let us examine these two aspects of an OS in turn. 


THE OPERATING SYSTEM AS A USER/COMPUTER INTERFACE The hardware 
and software used in providing applications to a user can be viewed in a layered or 
hierarchical fashion, as depicted in Figure 8.1. The user of those applications, the 
end user, generally is not concerned with the computer’s architecture. Thus the 
end user views a computer system in terms of an application. That application can 
be expressed in a programming language and is developed by an application pro- 
grammer. To develop an application program as a set of processor instructions 
that is completely responsible for controlling the computer hardware would be an 
overwhelmingly complex task. To ease this task, a set of systems programs is pro- 
vided. Some of these programs are referred to as utilities. These implement fre- 
quently used functions that assist in program creation, the management of files, 
and the control of I/O devices. A programmer makes use of these facilities in de- 
veloping an application, and the application, while it is running, invokes the utili- 
ties to perform certain functions. The most important system program is the OS. 
The OS masks the details of the hardware from the programmer and provides the 
programmer with a convenient interface for using the system. It acts as mediator, 
making it easier for the programmer and for application programs to access and 
use those facilities and services. 


Programmer 


Operating 
system 
designer 


Application programs 


Utilities 


Operating system 


Computer hardware 


Figure 8.1 Layers and Views of a Computer System 
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Briefly, the OS typically provides services in the following areas: 


e Program creation: The OS provides a variety of facilities and services, such as 
editors and debuggers, to assist the programmer in creating programs. Typi- 
cally, these services are in the form of utility programs that are not actually 
part of the OS but are accessible through the OS. 


e Program execution: A number of tasks need to be performed to execute a pro- 
gram. Instructions and data must be loaded into main memory, I/O devices 
and files must be initialized, and other resources must be prepared. The OS 
handles all of this for the user. 


e Access to I/O devices: Each I/O device requires its own specific set of instruc- 
tions or control signals for operation. The OS takes care of the details so that 
the programmer can think in terms of simple reads and writes. 


e Controlled access to files: In the case of files, control must include an under- 
standing of not only the nature of the I/O device (disk drive, tape drive) but 
also the file format on the storage medium. Again, the OS worries about the 
details. Further, in the case of a system with multiple simultaneous users, the 
OS can provide protection mechanisms to control access to the files. 


e System access: In the case of a shared or public system, the OS controls access 
to the system as a whole and to specific system resources. The access function 
must provide protection of resources and data from unauthorized users and 
must resolve conflicts for resource contention. 


e Error detection and response: A variety of errors can occur while a computer 
system is running. These include internal and external hardware errors, such as 
a memory error, or a device failure or malfunction; and various software er- 
rors, such as arithmetic overflow, attempt to access forbidden memory loca- 
tion, and inability of the OS to grant the request of an application. In each 
case, the OS must make the response that clears the error condition with the 
least impact on running applications. The response may range from ending the 
program that caused the error, to retrying the operation, to simply reporting 
the error to the application. 


e Accounting: A good OS collects usage statistics for various resources and 
monitor performance parameters such as response time. On any system, this 
information is useful in anticipating the need for future enhancements and in 
tuning the system to improve performance. On a multiuser system, the infor- 
mation can be used for billing purposes. 


THE OPERATING SYSTEM AS RESOURCE MANAGER A computer is a set of 
resources for the movement, storage, and processing of data and for the control of 
these functions. The OS is responsible for managing these resources. 

Can we say that the OS controls the movement, storage, and processing of 
data? From one point of view, the answer is yes: By managing the computer’s 
resources, the OS is in control of the computer’s basic functions. But this control is 
exercised in a curious way. Normally, we think of a control mechanism as something 
external to that which is controlled, or at least as something that is a distinct and 
separate part of that which is controlled. (For example, a residential heating system 
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is controlled by a thermostat, which is completely distinct from the heat-generation 
and heat-distribution apparatus.) This is not the case with the OS, which as a control 
mechanism is unusual in two respects: 


e The OS functions in the same way as ordinary computer software; that is, it is a 
program executed by the processor. 


e The OS frequently relinquishes control and must depend on the processor to 
allow it to regain control. 


The OS is, in fact, nothing more than a computer program. Like other com- 
puter programs, it provides instructions for the processor. The key difference is in 
the intent of the program. The OS directs the processor in the use of the other system 
resources and in the timing of its execution of other programs. But in order for the 
processor to do any of these things, it must cease executing the OS program and ex- 
ecute other programs. Thus, the OS relinquishes control for the processor to do 
some “useful” work and then resumes control long enough to prepare the processor 
to do the next piece of work. The mechanisms involved in all this should become 
clear as the chapter proceeds. 

Figure 8.2 suggests the main resources that are managed by the OS. A portion of 
the OS is in main memory. This includes the kernel, or nucleus, which contains the 
most frequently used functions in the OS and, at a given time, other portions of the OS 
currently in use. The remainder of main memory contains user programs and data. The 
allocation of this resource (main memory) is controlled jointly by the OS and memory- 
management hardware in the processor, as we shall see. The OS decides when an I/O 
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Figure 8.2 The Operating System as Resource Manager 
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device can be used by a program in execution, and controls access to and use of files. 
The processor itself is a resource, and the OS must determine how much processor 
time is to be devoted to the execution of a particular user program. In the case of a 
multiple-processor system, this decision must span all of the processors. 


Types of Operating Systems 


Certain key characteristics serve to differentiate various types of operating systems. 
The characteristics fall along two independent dimensions. The first dimension spec- 
ifies whether the system is batch or interactive. In an interactive system, the user/ 
programmer interacts directly with the computer, usually through a keyboard/display 
terminal, to request the execution of a job or to perform a transaction. Furthermore, 
the user may, depending on the nature of the application, communicate with the com- 
puter during the execution of the job. A batch system is the opposite of interactive. 
The user’s program is batched together with programs from other users and submit- 
ted by a computer operator. After the program is completed, results are printed out 
for the user. Pure batch systems are rare today. However, it will be useful to the 
description of contemporary operating systems to examine batch systems briefly. 

An independent dimension specifies whether the system employs multiprogram- 
ming or not. With multiprogramming, the attempt is made to keep the processor as 
busy as possible, by having it work on more than one program at a time. Several pro- 
grams are loaded into memory, and the processor switches rapidly among them. The 
alternative is a uniprogramming system that works only one program at a time. 


EARLY SYSTEMS With the earliest computers, from the late 1940s to the mid-1950s, 
the programmer interacted directly with the computer hardware; there was no OS. 
These processors were run from a console, consisting of display lights, toggle switches, 
some form of input device, and a printer. Programs in processor code were loaded via 
the input device (e.g., a card reader). If an error halted the program, the error condi- 
tion was indicated by the lights. The programmer could proceed to examine registers 
and main memory to determine the cause of the error. If the program proceeded to 
a normal completion, the output appeared on the printer. 

These early systems presented two main problems.: 


e Scheduling: Most installations used a sign-up sheet to reserve processor time. 
Typically, a user could sign up for a block of time in multiples of a half hour or 
so. A user might sign up for an hour and finish in 45 minutes; this would result 
in wasted computer idle time. On the other hand, the user might run into prob- 
lems, not finish in the allotted time, and be forced to stop before resolving the 
problem. 


e Setup time: A single program, called a job, could involve loading the compiler 
plus the high-level language program (source program) into memory, saving 
the compiled program (object program), and then loading and linking together 
the object program and common functions. Each of these steps could involve 
mounting or dismounting tapes, or setting up card decks. If an error occurred, 
the hapless user typically had to go back to the beginning of the setup se- 
quence. Thus a considerable amount of time was spent just in setting up the 
program to run. 
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This mode of operation could be termed serial processing, reflecting the fact 
that users have access to the computer in series. Over time, various system software 
tools were developed to attempt to make serial processing more efficient. These in- 
clude libraries of common functions, linkers, loaders, debuggers, and I/O driver rou- 
tines that were available as common software for all users. 


SIMPLE BATCH SYSTEMS Early processors were very expensive, and therefore it 
was important to maximize processor utilization. The wasted time due to scheduling 
and setup time was unacceptable. 

To improve utilization, simple batch operating systems were developed. With 
such a system, also called a monitor, the user no longer has direct access to the 
processor. Rather, the user submits the job on cards or tape to a computer operator, 
who batches the jobs together sequentially and places the entire batch on an input 
device, for use by the monitor. 

To understand how this scheme works, let us look at it from two points of 
view: that of the monitor and that of the processor. From the point of view of the 
monitor, the monitor controls the sequence of events. For this to be so, much of the 
monitor must always be in main memory and available for execution (Figure 8.3). 
That portion is referred to as the resident monitor. The rest of the monitor consists 
of utilities and common functions that are loaded as subroutines to the user pro- 
gram at the beginning of any job that requires them. The monitor reads in jobs one 
at a time from the input device (typically a card reader or magnetic tape drive). As 
it is read in, the current job is placed in the user program area, and control is passed 
to this job. When the job is completed, it returns control to the monitor, which im- 
mediately reads in the next job. The results of each job are printed out for delivery 
to the user. 
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Now consider this sequence from the point of view of the processor. At a cer- 
tain point in time, the processor is executing instructions from the portion of main 
memory containing the monitor. These instructions cause the next job to be read in 
to another portion of main memory. Once a job has been read in, the processor will 
encounter in the monitor a branch instruction that instructs the processor to con- 
tinue execution at the start of the user program. The processor will then execute the 
instruction in the user’s program until it encounters an ending or error condition. 
Either event causes the processor to fetch its next instruction from the monitor pro- 
gram. Thus the phrase “control is passed to a job” simply means that the processor is 
now fetching and executing instructions in a user program, and “control is returned 
to the monitor” means that the processor is now fetching and executing instructions 
from the monitor program. 

It should be clear that the monitor handles the scheduling problem. A batch of 
jobs is queued up, and jobs are executed as rapidly as possible, with no intervening 
idle time. 

How about the job setup time? The monitor handles this as well. With each 
job, instructions are included in a job control language (JCL). This is a special 
type of programming language used to provide instructions to the monitor. A 
simple example is that of a user submitting a program written in FORTRAN 
plus some data to be used by the program. Each FORTRAN instruction and 
each item of data is on a separate punched card or a separate record on tape. 
In addition to FORTRAN and data lines, the job includes job control instruc- 
tions, which are denoted by the beginning “$”. The overall format of the job 
looks like this: 


$JOB 
$FTN 
} FORTRAN instructions 
$LOAD 
$RUN 
} Data 
$END 


To execute this job, the monitor reads the $FTN line and loads the appropriate 
compiler from its mass storage (usually tape). The compiler translates the user’s 
program into object code, which is stored in memory or mass storage. If it is stored 
in memory, the operation is referred to as “compile, load, and go.” If it is stored on 
tape, then the $LOAD instruction is required. This instruction is read by the moni- 
tor, which regains control after the compile operation. The monitor invokes the 
loader, which loads the object program into memory in place of the compiler and 
transfers control to it. In this manner, a large segment of main memory can be 
shared among different subsystems, although only one such subsystem could be res- 
ident and executing at a time. 

We see that the monitor, or batch OS, is simply a computer program. It relies 
on the ability of the processor to fetch instructions from various portions of main 
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Read one record from file 15 ps 
Execute 100 instructions 1 ps 
Write one record to file 15 ws 
TOTAL 31 ps 


Percent CPU utilization = - = 0.032 = 3.2% 


Figure 8.4 System Utilization Example 


memory in order to seize and relinquish control alternately. Certain other hardware 
features are also desirable: 


e Memory protection: While the user program is executing, it must not alter the 
memory area containing the monitor. If such an attempt is made, the processor 
hardware should detect an error and transfer control to the monitor. The mon- 
itor would then abort the job, print out an error message, and load the next job. 


e Timer: A timer is used to prevent a single job from monopolizing the system. 
The timer is set at the beginning of each job. If the timer expires, an interrupt 
occurs, and control returns to the monitor. 


e Privileged instructions: Certain instructions are designated privileged and can 
be executed only by the monitor. If the processor encounters such an instruction 
while executing a user program, an error interrupt occurs. Among the privileged 
instructions are I/O instructions, so that the monitor retains control of all I/O de- 
vices. This prevents, for example, a user program from accidentally reading job 
control instructions from the next job. If a user program wishes to perform I/O, it 
must request that the monitor perform the operation for it. If a privileged in- 
struction is encountered by the processor while it is executing a user program, the 
processor hardware considers this an error and transfers control to the monitor. 


e Interrupts: Early computer models did not have this capability. This feature 
gives the OS more flexibility in relinquishing control to and regaining control 
from user programs. 


Processor time alternates between execution of user programs and execution 
of the monitor. There have been two sacrifices: Some main memory is now given 
over to the monitor and some processor time is consumed by the monitor. Both of 
these are forms of overhead. Even with this overhead, the simple batch system im- 
proves utilization of the computer. 


MULTIPROGRAMMED BATCH SYSTEMS Even with the automatic job sequencing 
provided by a simple batch OS, the processor is often idle. The problem is that I/O 
devices are slow compared to the processor. Figure 8.4 details a representative cal- 
culation. The calculation concerns a program that processes a file of records and 
performs, on average, 100 processor instructions per record. In this example the 
computer spends over 96% of its time waiting for I/O devices to finish transferring 
data! Figure 8.5a illustrates this situation. The processor spends a certain amount of 
time executing, until it reaches an I/O instruction. It must then wait until that I/O 
instruction concludes before proceeding. 
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(c) Multiprogramming with three programs 


Figure 8.5 Multiprogramming Example 


This inefficiency is not necessary. We know that there must be enough mem- 
ory to hold the OS (resident monitor) and one user program. Suppose that there 
is room for the OS and two user programs. Now, when one job needs to wait for 
I/O, the processor can switch to the other job, which likely is not waiting for I/O 
(Figure 8.5b). Furthermore, we might expand memory to hold three, four, or 
more programs and switch among all of them (Figure 8.5c). This technique is 
known as multiprogramming, or multitasking.! It is the central theme of modern 
operating systems. 


The term multitasking is sometimes reserved to mean multiple tasks within the same program that may 
be handled concurrently by the OS, in contrast to multiprogramming, which would refer to multiple 
processes from multiple programs. However, it is more common to equate the terms multitasking and 
multiprogramming, as is done in most standards dictionaries (e.g., IEEE Std 100-1992, The New IEEE 
Standard Dictionary of Electrical and Electronics Terms). 
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Table 8.1 Sample Program Execution Attributes 


JOB1 JOB2 JOB3 


Type of job Heavy compute Heavy I/O Heavy I/O 
Duration 5 min 15 min 10 min 
Memory required 50M 100 M 80 M 
Need disk? 

Need terminal? 


Need printer? 


Example 8.1 This example illustrates the benefit of multiprogramming. Consider a com- 
puter with 250 MBytes of available memory (not used by the OS), a disk, a terminal, and 
a printer. Three programs, JOB1, JOB2, and JOB3, are submitted for execution at the 
same time, with the attributes listed in Table 8.1. We assume minimal processor require- 


ments for JOB2 and JOB3 and continuous disk and printer use by JOB3. For a simple 
batch environment, these jobs will be executed in sequence. Thus, JOB1 completes in 
5 minutes. JOB2 must wait until the 5 minutes is over and then completes 15 minutes after 
that. JOB3 begins after 20 minutes and completes at 30 minutes from the time it was ini- 
tially submitted. The average resource utilization, throughput, and response times are 
shown in the uniprogramming column of Table 8.2. Device-by-device utilization is illus- 
trated in Figure 8.6a. It is evident that there is gross underutilization for all resources 
when averaged over the required 30-minute time period. 

Now suppose that the jobs are run concurrently under a multiprogramming OS. Be- 
cause there is little resource contention between the jobs, all three can run in nearly mini- 
mum time while coexisting with the others in the computer (assuming that JOB2 and 
JOB3 are allotted enough processor time to keep their input and output operations ac- 
tive). JOB1 will still require 5 minutes to complete but at the end of that time, JOB2 will 
be one-third finished, and JOB3 will be half finished. All three jobs will have finished 
within 15 minutes. The improvement is evident when examining the multiprogramming 
column of Table 8.2, obtained from the histogram shown in Figure 8.6b. 


As with a simple batch system, a multiprogramming batch system must rely on 
certain computer hardware features. The most notable additional feature that is use- 
ful for multiprogramming is the hardware that supports I/O interrupts and DMA. 


Table 8.2 Effects of Multiprogramming on Resource Utilization 


Uniprogramming Multiprogramming 


Processor use 20% 40% 
Memory use 33% 67% 
Disk use 33% 67% 
Printer use 33% 67% 


Elapsed time 30 min 15 min 
Throughput rate 6 jobs/hr 12 jobs/hr 


Mean response time 18 min 10 min 
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Figure 8.6 Utilization Histograms 


With interrupt-driven I/O or DMA, the processor can issue an I/O command for 
one job and proceed with the execution of another job while the I/O is carried out 
by the device controller. When the I/O operation is complete, the processor is inter- 
rupted and control is passed to an interrupt-handling program in the OS. The OS 
will then pass control to another job. 

Multiprogramming operating systems are fairly sophisticated compared to single- 
program, or uniprogramming, systems. To have several jobs ready to run, the jobs must 
be kept in main memory, requiring some form of memory management. In addition, if 
several jobs are ready to run, the processor must decide which one to run, which re- 
quires some algorithm for scheduling. These concepts are discussed later in this chapter. 


TIME-SHARING SYSTEMS With the use of multiprogramming, batch processing can 
be quite efficient. However, for many jobs, it is desirable to provide a mode in which 
the user interacts directly with the computer. Indeed, for some jobs, such as transac- 
tion processing, an interactive mode is essential. 

Today, the requirement for an interactive computing facility can be, and often is, 
met by the use of a dedicated microcomputer. That option was not available in the 
1960s, when most computers were big and costly. Instead, time sharing was developed. 

Just as multiprogramming allows the processor to handle multiple batch jobs 
at a time, multiprogramming can be used to handle multiple interactive jobs. In this 
latter case, the technique is referred to as time sharing, because the processor’s time 
is shared among multiple users. In a time-sharing system, multiple users simultaneously 
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Table 8.3. Batch Multiprogramming versus Time Sharing 


Batch Multiprogramming Time Sharing 


Principal objective Maximize processor use Minimize response time 


Source of directives to Job control language Commands entered at the 
operating system commands provided with the job terminal 


access the system through terminals, with the OS interleaving the execution of each 
user program in a short burst or quantum of computation. Thus, if there are n users 
actively requesting service at one time, each user will only see on the average 1/n of 
the effective computer speed, not counting OS overhead. However, given the rela- 
tively slow human reaction time, the response time on a properly designed system 
should be comparable to that on a dedicated computer. 

Both batch multiprogramming and time sharing use multiprogramming. The 
key differences are listed in Table 8.3. 


8.2 SCHEDULING 


The key to multiprogramming is scheduling. In fact, four types of scheduling are typ- 
ically involved (Table 8.4). We will explore these presently. But first, we introduce 
the concept of process. This term was first used by the designers of the Multics OS in 
the 1960s. It is a somewhat more general term than job. Many definitions have been 
given for the term process, including 


e A program in execution 
e The “animated spirit” of a program 
e That entity to which a processor is assigned 


This concept should become clearer as we proceed. 


Long-Term Scheduling 


The long-term scheduler determines which programs are admitted to the system for 
processing. Thus, it controls the degree of multiprogramming (number of processes 
in memory). Once admitted, a job or user program becomes a process and is added 
to the queue for the short-term scheduler. In some systems, a newly created process 


Table 8.4 Types of Scheduling 


Long-term scheduling The decision to add to the pool of processes to be executed 


Medium-term scheduling The decision to add to the number of processes that are 
partially or fully in main memory 


Short-term scheduling The decision as to which available process will be executed 
by the processor 


I/O scheduling The decision as to which process’s pending I/O request shall 
be handled by an available I/O device 
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begins in a swapped-out condition, in which case it is added to a queue for the 
medium-term scheduler. 

In a batch system, or for the batch portion of a general-purpose OS, newly sub- 
mitted jobs are routed to disk and held in a batch queue. The long-term scheduler cre- 
ates processes from the queue when it can. There are two decisions involved here. First, 
the scheduler must decide that the OS can take on one or more additional processes. 
Second, the scheduler must decide which job or jobs to accept and turn into processes. 
The criteria used may include priority, expected execution time, and I/O requirements. 

For interactive programs in a time-sharing system, a process request is gener- 
ated when a user attempts to connect to the system. Time-sharing users are not sim- 
ply queued up and kept waiting until the system can accept them. Rather, the OS 
will accept all authorized comers until the system is saturated, using some prede- 
fined measure of saturation. At that point, a connection request is met with a mes- 
sage indicating that the system is full and the user should try again later. 


Medium-Term Scheduling 


Medium-term scheduling is part of the swapping function, described in Section 8.3. 
Typically, the swapping-in decision is based on the need to manage the degree of 
multiprogramming. On a system that does not use virtual memory, memory man- 
agement is also an issue. Thus, the swapping-in decision will consider the memory 
requirements of the swapped-out processes. 


Short-Term Scheduling 


The long-term scheduler executes relatively infrequently and makes the coarse- 
grained decision of whether or not to take on a new process, and which one to take. 
The short-term scheduler, also known as the dispatcher, executes frequently and 
makes the fine-grained decision of which job to execute next. 


PROCESS STATES To understand the operation of the short-term scheduler, we need 
to consider the concept of a process state. During the lifetime of a process, its status 
will change a number of times. Its status at any point in time is referred to as a State. 
The term state is used because it connotes that certain information exists that defines 
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the status at that point. At minimum, there are five defined states for a process 
(Figure 8.7): 
e New: A program is admitted by the high-level scheduler but is not yet ready to 
execute. The OS will initialize the process, moving it to the ready state. 
e Ready: The process is ready to execute and is awaiting access to the processor. 
e Running: The process is being executed by the processor. 


e Waiting: The process is suspended from execution waiting for some system 
resource, such as I/O. 


e Halted: The process has terminated and will be destroyed by the OS. 

For each process in the system, the OS must maintain information indicating 
the state of the process and other information necessary for process execution. For 
this purpose, each process is represented in the OS by a process control block 
(Figure 8.8), which typically contains 

e Identifier: Each current process has a unique identifier. 
e State: The current state of the process (new, ready, and so on). 
e Priority: Relative priority level. 


e Program counter: The address of the next instruction in the program to be 
executed. 


e Memory pointers: The starting and ending locations of the process in memory. 


e Context data: These are data that are present in registers in the processor while 
the process is executing, and they will be discussed in Part Three. For now, it is 
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Figure 8.8 Process Control Block 
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enough to say that these data represent the “context” of the process. The context 
data plus the program counter are saved when the process leaves the running state. 
They are retrieved by the processor when it resumes execution of the process. 


I/O status information: Includes outstanding I/O requests, I/O devices (e.g., tape 
drives) assigned to this process, a list of files assigned to the process, and so on. 


> Accounting information: May include the amount of processor time and clock 
time used, time limits, account numbers, and so on. 


When the scheduler accepts a new job or user request for execution, it creates 
a blank process control block and places the associated process in the new state. 
After the system has properly filled in the process control block, the process is 
transferred to the ready state. 


CH iCH S To understand how the OS manages the scheduling of the 
various isiobeut in memory, ict us begin by considering the simple example in Figure 8.9. 
The figure shows how main memory is partitioned at a given point in time. The kernel 
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of the OS is, of course, always resident. In addition, there are a number of active 
processes, including A and B, each of which is allocated a portion of memory. 

We begin at a point in time when process A is running. The processor is exe- 
cuting instructions from the program contained in A’s memory partition. At some 
later point in time, the processor ceases to execute instructions in A and begins exe- 
cuting instructions in the OS area. This will happen for one of three reasons: 


1. Process A issues a service call (e.g., an I/O request) to the OS. Execution of A 
is suspended until this call is satisfied by the OS. 


2. Process A causes an interrupt. An interrupt is a hardware-generated signal to 
the processor. When this signal is detected, the processor ceases to execute A 
and transfers to the interrupt handler in the OS. A variety of events related to 
A will cause an interrupt. One example is an error, such as attempting to exe- 
cute a privileged instruction. Another example is a timeout; to prevent any one 
process from monopolizing the processor, each process is only granted the 
processor for a short period at a time. 


3. Some event unrelated to process A that requires attention causes an interrupt. 
An example is the completion of an I/O operation. 


In any case, the result is the following. The processor saves the current context 
data and the program counter for A in A’s process control block and then begins ex- 
ecuting in the OS. The OS may perform some work, such as initiating an I/O opera- 
tion. Then the short-term-scheduler portion of the OS decides which process should 
be executed next. In this example, B is chosen. The OS instructs the processor to 
restore B’s context data and proceed with the execution of B where it left off. 

This simple example highlights the basic functioning of the short-term scheduler. 
Figure 8.10 shows the major elements of the OS involved in the multiprogramming 
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and scheduling of processes. The OS receives control of the processor at the inter- 
rupt handler if an interrupt occurs and at the service-call handler if a service call oc- 
curs. Once the interrupt or service call is handled, the short-term scheduler is 
invoked to select a process for execution. 

To do its job, the OS maintains a number of queues. Each queue is simply a 
waiting list of processes waiting for some resource. The long-term queue is a list of 
jobs waiting to use the system. As conditions permit, the high-level scheduler will al- 
locate memory and create a process for one of the waiting items. The short-term 
queue consists of all processes in the ready state. Any one of these processes could 
use the processor next. It is up to the short-term scheduler to pick one. Generally, 
this is done with a round-robin algorithm, giving each process some time in turn. Pri- 
ority levels may also be used. Finally, there is an I/O queue for each I/O device. 
More than one process may request the use of the same I/O device. All processes 
waiting to use each device are lined up in that device’s queue. 

Figure 8.11 suggests how processes progress through the computer under the 
control of the OS. Each process request (batch job, user-defined interactive job) is 
placed in the long-term queue. As resources become available, a process request be- 
comes a process and is then placed in the ready state and put in the short-term 
queue. The processor alternates between executing OS instructions and executing 
user processes. While the OS is in control, it decides which process in the short-term 
queue should be executed next. When the OS has finished its immediate tasks, it 
turns the processor over to the chosen process. 

As was mentioned earlier, a process being executed may be suspended for a 
variety of reasons. If it is suspended because the process requests I/O, then it is 
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Figure 8.11 Queuing Diagram Representation of Processor Scheduling 
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placed in the appropriate I/O queue. If it is suspended because of a timeout or be- 
cause the OS must attend to pressing business, then it is placed in the ready state 
and put into the short-term queue. 

Finally, we mention that the OS also manages the I/O queues. When an I/O op- 
eration is completed, the OS removes the satisfied process from that I/O queue and 
places it in the short-term queue. It then selects another waiting process (if any) and 
signals for the I/O device to satisfy that process’s request. 


8.3 MEMORY MANAGEMENT 


In a uniprogramming system, main memory is divided into two parts: one part for 
the OS (resident monitor) and one part for the program currently being executed. 
In a multiprogramming system, the “user” part of memory is subdivided to accom- 
modate multiple processes. The task of subdivision is carried out dynamically by the 
OS and is known as memory management. 

Effective memory management is vital in a multiprogramming system. If only 
a few processes are in memory, then for much of the time all of the processes will be 
waiting for I/O and the processor will be idle. Thus, memory needs to be allocated 
efficiently to pack as many processes into memory as possible. 


Swapping 


Referring back to Figure 8.11, we have discussed three types of queues: the long- 
term queue of requests for new processes, the short-term queue of processes ready 
to use the processor, and the various I/O queues of processes that are not ready to 
use the processor. Recall that the reason for this elaborate machinery is that I/O ac- 
tivities are much slower than computation and therefore the processor in a unipro- 
gramming system is idle most of the time. 

But the arrangement in Figure 8.11 does not entirely solve the problem. It is 
true that, in this case, memory holds multiple processes and that the processor can 
move to another process when one process is waiting. But the processor is so 
much faster than I/O that it will be common for all the processes in memory to be 
waiting on I/O. Thus, even with multiprogramming, a processor could be idle most 
of the time. 

What to do? Main memory could be expanded, and so be able to accommo- 
date more processes. But there are two flaws in this approach. First, main memory is 
expensive, even today. Second, the appetite of programs for memory has grown as 
fast as the cost of memory has dropped. So larger memory results in larger processes, 
not more processes. 

Another solution is swapping, depicted in Figure 8.12. We have a long-term 
queue of process requests, typically stored on disk. These are brought in, one at a 
time, as space becomes available. As processes are completed, they are moved out of 
main memory. Now the situation will arise that none of the processes in memory are 
in the ready state (e.g., all are waiting on an I/O operation). Rather than remain idle, 
the processor swaps one of these processes back out to disk into an intermediate 
queue. This is a queue of existing processes that have been temporarily kicked out of 
memory. The OS then brings in another process from the intermediate queue, or it 
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Figure 8.12 The Use of Swapping 


honors a new process request from the long-term queue. Execution then continues 
with the newly arrived process. 

Swapping, however, is an I/O operation, and therefore there is the potential 
for making the problem worse, not better. But because disk I/O is generally the 
fastest I/O on a system (e.g., compared with tape or printer I/O), swapping will usu- 
ally enhance performance. A more sophisticated scheme, involving virtual memory, 
improves performance over simple swapping. This will be discussed shortly. But 
first, we must prepare the ground by explaining partitioning and paging. 


Partitioning 


The simplest scheme for partitioning available memory is to use fixed-size partitions, 
as shown in Figure 8.13. Note that, although the partitions are of fixed size, they 
need not be of equal size. When a process is brought into memory, it is placed in the 
smallest available partition that will hold it. 

Even with the use of unequal fixed-size partitions, there will be wasted memory. 
In most cases, a process will not require exactly as much memory as provided by the 
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Figure 8.13. Example of Fixed Partitioning of a 64-Mbyte Memory 


partition. For example, a process that requires 3M bytes of memory would be placed 
in the 4M partition of Figure 8.13b, wasting 1M that could be used by another process. 

A more efficient approach is to use variable-size partitions. When a process is 
brought into memory, it is allocated exactly as much memory as it requires and no more. 


Example 8.2. An example, using 64 MBytes of main memory, is shown in Figure 8.14. Ini- 
tially, main memory is empty, except for the OS (a). The first three processes are loaded 
in, starting where the OS ends and occupying just enough space for each process (b,c, d). 
This leaves a “hole” at the end of memory that is too small for a fourth process. At some 
point, none of the processes in memory is ready. The OS swaps out process 2 (e), which 


leaves sufficient room to load a new process, process 4 (f). Because process 4 is smaller 
than process 2, another small hole is created. Later, a point is reached at which none of the 
processes in main memory is ready, but process 2, in the Ready-Suspend state, is available. 
Because there is insufficient room in memory for process 2, the OS swaps process 1 
out (g) and swaps process 2 back in (h). 
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Figure 8.14 The Effect of Dynamic Partitioning 


As this example shows, this method starts out well, but eventually it leads to a 
situation in which there are a lot of small holes in memory. As time goes on, memory 
becomes more and more fragmented, and memory utilization declines. One tech- 
nique for overcoming this problem is compaction: From time to time, the OS shifts 
the processes in memory to place all the free memory together in one block. This is 
a time-consuming procedure, wasteful of processor time. 

Before we consider ways of dealing with the shortcomings of partitioning, we 
must clear up one loose end. Consider Figure 8.14; it should be obvious that a 
process is not likely to be loaded into the same place in main memory each time it is 
swapped in. Furthermore, if compaction is used, a process may be shifted while in 
main memory. A process in memory consists of instructions plus data. The instruc- 
tions will contain addresses for memory locations of two types: 


e Addresses of data items 


e Addresses of instructions, used for branching instructions 
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But these addresses are not fixed. They will change each time a process is 
swapped in. To solve this problem, a distinction is made between logical addresses 
and physical addresses. A logical address is expressed as a location relative to the 
beginning of the program. Instructions in the program contain only logical ad- 
dresses. A physical address is an actual location in main memory. When the 
processor executes a process, it automatically converts from logical to physical ad- 
dress by adding the current starting location of the process, called its base address, 
to each logical address. This is another example of a processor hardware feature 
designed to meet an OS requirement. The exact nature of this hardware feature 
depends on the memory management strategy in use. We will see several examples 
later in this chapter. 


Paging 


Both unequal fixed-size and variable-size partitions are inefficient in the use of 
memory. Suppose, however, that memory is partitioned into equal fixed-size chunks 
that are relatively small, and that each process is also divided into small fixed-size 
chunks of some size. Then the chunks of a program, known as pages, could be as- 
signed to available chunks of memory, known as frames, or page frames. At most, 
then, the wasted space in memory for that process is a fraction of the last page. 
Figure 8.15 shows an example of the use of pages and frames. At a given point 
in time, some of the frames in memory are in use and some are free. The list of free 
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Figure 8.15 Allocation of Free Frames 
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frames is maintained by the OS. Process A, stored on disk, consists of four pages. 
When it comes time to load this process, the OS finds four free frames and loads the 
four pages of the process A into the four frames. 

Now suppose, as in this example, that there are not sufficient unused con- 
tiguous frames to hold the process. Does this prevent the OS from loading A? The 
answer is no, because we can once again use the concept of logical address. A sim- 
ple base address will no longer suffice. Rather, the OS maintains a page table for 
each process. The page table shows the frame location for each page of the 
process. Within the program, each logical address consists of a page number and a 
relative address within the page. Recall that in the case of simple partitioning, a 
logical address is the location of a word relative to the beginning of the program; 
the processor translates that into a physical address. With paging, the logical-to- 
physical address translation is still done by processor hardware. The processor 
must know how to access the page table of the current process. Presented with a 
logical address (page number, relative address), the processor uses the page table 
to produce a physical address (frame number, relative address). An example is 
shown in Figure 8.16. 

This approach solves the problems raised earlier. Main memory is divided 
into many small equal-size frames. Each process is divided into frame-size pages: 
smaller processes require fewer pages, larger processes require more. When a 
process is brought in, its pages are loaded into available frames, and a page table 
is set up. 
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Figure 8.16 Logical and Physical Addresses 
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Virtual Memory 


DEMAND PAGING With the use of paging, truly effective multiprogramming sys- 
tems came into being. Furthermore, the simple tactic of breaking a process up into 
pages led to the development of another important concept: virtual memory. 

To understand virtual memory, we must add a refinement to the paging 
scheme just discussed. That refinement is demand paging, which simply means that 
each page of a process is brought in only when it is needed, that is, on demand. 

Consider a large process, consisting of a long program plus a number of ar- 
rays of data. Over any short period of time, execution may be confined to a small 
section of the program (e.g., a subroutine), and perhaps only one or two arrays of 
data are being used. This is the principle of locality, which we introduced in Appen- 
dix 4A. It would clearly be wasteful to load in dozens of pages for that process 
when only a few pages will be used before the program is suspended. We can make 
better use of memory by loading in just a few pages. Then, if the program branches 
to an instruction on a page not in main memory, or if the program references data 
on a page not in memory, a page fault is triggered. This tells the OS to bring in the 
desired page. 

Thus, at any one time, only a few pages of any given process are in memory, 
and therefore more processes can be maintained in memory. Furthermore, time is 
saved because unused pages are not swapped in and out of memory. However, the 
OS must be clever about how it manages this scheme. When it brings one page in, 
it must throw another page out; this is known as page replacement. If it throws out 
a page just before it is about to be used, then it will just have to go get that page 
again almost immediately. Too much of this leads to a condition known as 
thrashing: the processor spends most of its time swapping pages rather than exe- 
cuting instructions. The avoidance of thrashing was a major research area in the 
1970s and led to a variety of complex but effective algorithms. In essence, the OS 
tries to guess, based on recent history, which pages are least likely to be used in the 
near future. 


Page Replacement Algorithm Simulators 


A discussion of page replacement algorithms is beyond the scope of this chap- 
ter. A potentially effective technique is least recently used (LRU), the same algo- 
rithm discussed in Chapter 4 for cache replacement. In practice, LRU is difficult to 
implement for a virtual memory paging scheme. Several alternative approaches that 
seek to approximate the performance of LRU are in use; see Appendix F for details. 

With demand paging, it is not necessary to load an entire process into main 
memory. This fact has a remarkable consequence: It is possible for a process to be 
larger than all of main memory. One of the most fundamental restrictions in pro- 
gramming has been lifted. Without demand paging, a programmer must be acutely 
aware of how much memory is available. If the program being written is too large, 
the programmer must devise ways to structure the program into pieces that can be 
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loaded one at a time. With demand paging, that job is left to the OS and the hard- 
ware. As far as the programmer is concerned, he or she is dealing with a huge mem- 
ory, the size associated with disk storage. 

Because a process executes only in main memory, that memory is referred to 
as real memory. But a programmer or user perceives a much larger memory —that 
which is allocated on the disk. This latter is therefore referred to as virtual memory. 
Virtual memory allows for very effective multiprogramming and relieves the user of 
the unnecessarily tight constraints of main memory. 


PAGE TABLE STRUCTURE The basic mechanism for reading a word from memory 
involves the translation of a virtual, or logical, address, consisting of page number 
and offset, into a physical address, consisting of frame number and offset, using a 
page table. Because the page table is of variable length, depending on the size of the 
process, we cannot expect to hold it in registers. Instead, it must be in main memory 
to be accessed. Figure 8.16 suggests a hardware implementation of this scheme. 
When a particular process is running, a register holds the starting address of the 
page table for that process. The page number of a virtual address is used to index 
that table and look up the corresponding frame number. This is combined with the 
offset portion of the virtual address to produce the desired real address. 

In most systems, there is one page table per process. But each process can oc- 
cupy huge amounts of virtual memory. For example, in the VAX architecture, each 
process can have up to 27! = 2 GBytes of virtual memory. Using 2” = 512-byte 
pages, that means that as many as 2” page table entries are required per process. 
Clearly, the amount of memory devoted to page tables alone could be unaccept- 
ably high. To overcome this problem, most virtual memory schemes store page 
tables in virtual memory rather than real memory. This means that page tables are 
subject to paging just as other pages are. When a process is running, at least a part 
of its page table must be in main memory, including the page table entry of the 
currently executing page. Some processors make use of a two-level scheme to or- 
ganize large page tables. In this scheme, there is a page directory, in which each 
entry points to a page table. Thus, if the length of the page directory is X, and if the 
maximum length of a page table is Y, then a process can consist of up to X X Y 
pages. Typically, the maximum length of a page table is restricted to be equal to 
one page. We will see an example of this two-level approach when we consider the 
Pentium II later in this chapter. 

An alternative approach to the use of one- or two-level page tables is the use 
of an inverted page table structure (Figure 8.17). Variations on this approach are 
used on the PowerPC, UltraSPARC, and the IA-64 architecture. An implementation 
of the Mach OS on the RT-PC also uses this technique. 

In this approach, the page number portion of a virtual address is mapped 
into a hash value using a simple hashing function.” The hash value is a pointer to 


?A hash function maps numbers in the range 0 through M into numbers in the range 0 through N, where 
M > N.The output of the hash function is used as an index into the hash table. Since more than one input 
maps into the same output, it is possible for an input item to map to a hash table entry that is already oc- 
cupied. In that case, the new item must overflow into another hash table location. Typically, the new item is 
placed in the first succeeding empty space, and a pointer from the original location is provided to chain the 
entries together. See Appendix C for more information on hash functions. 
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Figure 8.17 Inverted Page Table Structure 


the inverted page table, which contains the page table entries. There is one entry in 
the inverted page table for each real memory page frame rather than one per vir- 
tual page. Thus a fixed proportion of real memory is required for the tables re- 
gardless of the number of processes or virtual pages supported. Because more 
than one virtual address may map into the same hash table entry, a chaining tech- 
nique is used for managing the overflow. The hashing technique results in chains 
that are typically short—between one and two entries. The page table’s structure 
is called inverted because it indexes page table entries by frame number rather 
than by virtual page number. 


Translation Lookaside Buffer 


In principle, then, every virtual memory reference can cause two physical memory ac- 
cesses: one to fetch the appropriate page table entry, and one to fetch the desired data. 
Thus, a straightforward virtual memory scheme would have the effect of doubling the 
memory access time. To overcome this problem, most virtual memory schemes make 
use of a special cache for page table entries, usually called a translation lookaside 
buffer (TLB). This cache functions in the same way as a memory cache and contains 
those page table entries that have been most recently used. Figure 8.18 is a flowchart 
that shows the use of the TLB. By the principle of locality, most virtual memory refer- 
ences will be to locations in recently used pages. Therefore, most references will in- 
volve page table entries in the cache. Studies of the VAX TLB have shown that this 
scheme can significantly improve performance [CLAR85, SATY81]. 
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Figure 8.18 Operation of Paging and Translation Lookaside Buffer (TLB) 


Note that the virtual memory mechanism must interact with the cache system 
(not the TLB cache, but the main memory cache). This is illustrated in Figure 8.19. 
A virtual address will generally be in the form of a page number, offset. First, the 
memory system consults the TLB to see if the matching page table entry is present. 
If it is, the real (physical) address is generated by combining the frame number with 
the offset. If not, the entry is accessed from a page table. Once the real address is 
generated, which is in the form of a tag and a remainder, the cache is consulted to 
see if the block containing that word is present (see Figure 4.5). If so, it is returned 


to the processor. If not, the word is retrieved from main memory. 


The reader should be able to appreciate the complexity of the processor hard- 
ware involved in a single memory reference. The virtual address is translated into a 
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Figure 8.19 Translation Lookaside Buffer and Cache Operation 


real address. This involves reference to a page table, which may be in the TLB, in 
main memory, or on disk. The referenced word may be in cache, in main memory, or 
on disk. In the latter case, the page containing the word must be loaded into main 
memory and its block loaded into the cache. In addition, the page table entry for 
that page must be updated. 


Segmentation 


There is another way in which addressable memory can be subdivided, known as 
segmentation. Whereas paging is invisible to the programmer and serves the pur- 
pose of providing the programmer with a larger address space, segmentation is usu- 
ally visible to the programmer and is provided as a convenience for organizing 
programs and data and as a means for associating privilege and protection attributes 
with instructions and data. 

Segmentation allows the programmer to view memory as consisting of mul- 
tiple address spaces or segments. Segments are of variable, indeed dynamic, size. 
Typically, the programmer or the OS will assign programs and data to different 
segments. There may be a number of program segments for various types of pro- 
grams as well as a number of data segments. Each segment may be assigned access 
and usage rights. Memory references consist of a (segment number, offset) form 
of address. 
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This organization has a number of advantages to the programmer over a non- 
segmented address space: 


1. It simplifies the handling of growing data structures. If the programmer does 
not know ahead of time how large a particular data structure will become, it is 
not necessary to guess. The data structure can be assigned its own segment, 
and the OS will expand or shrink the segment as needed. 


2. It allows programs to be altered and recompiled independently without re- 
quiring that an entire set of programs be relinked and reloaded. Again, this is 
accomplished using multiple segments. 


3. It lends itself to sharing among processes. A programmer can place a utility 
program or a useful table of data in a segment that can be addressed by other 
processes. 


4. It lends itself to protection. Because a segment can be constructed to contain a 
well-defined set of programs or data, the programmer or a system administra- 
tor can assign access privileges in a convenient fashion. 


These advantages are not available with paging, which is invisible to the pro- 
grammer. On the other hand, we have seen that paging provides for an efficient 
form of memory management. To combine the advantages of both, some systems 
are equipped with the hardware and OS software to provide both. 


8.4 PENTIUM MEMORY MANAGEMENT 


Since the introduction of the 32-bit architecture, microprocessors have evolved sophis- 
ticated memory management schemes that build on the lessons learned with medium- 
and large-scale systems. In many cases, the microprocessor versions are superior to 
their larger-system antecedents. Because the schemes were developed by the micro- 
processor hardware vendor and may be employed with a variety of operating systems, 
they tend to be quite general purpose. A representative example is the scheme used on 
the Pentium II. The Pentium II memory management hardware is essentially the same 
as that used in the Intel 80386 and 80486 processors, with some refinements. 


Address Spaces 


The Pentium II includes hardware for both segmentation and paging. Both mecha- 
nisms can be disabled, allowing the user to choose from four distinct views of memory: 


e Unsegmented unpaged memory: In this case, the virtual address is the same as 
the physical address. This is useful, for example, in low-complexity, high- 
performance controller applications. 


e Unsegmented paged memory: Here memory is viewed as a paged linear ad- 
dress space. Protection and management of memory is done via paging. This is 
favored by some operating systems (e.g., Berkeley UNIX). 

e Segmented unpaged memory: Here memory is viewed as a collection of logi- 
cal address spaces. The advantage of this view over a paged approach is that it 
affords protection down to the level of a single byte, if necessary. Furthermore, 
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unlike paging, it guarantees that the translation table needed (the segment 
table) is on-chip when the segment is in memory. Hence, segmented unpaged 
memory results in predictable access times. 


e Segmented paged memory: Segmentation is used to define logical memory 
partitions subject to access control, and paging is used to manage the alloca- 
tion of memory within the partitions. Operating systems such as UNIX System 
V favor this view. 


Segmentation 


When segmentation is used, each virtual address (called a logical address in the Pen- 
tium II documentation) consists of a 16-bit segment reference and a 32-bit offset. 
Two bits of the segment reference deal with the protection mechanism, leaving 
14 bits for specifying a particular segment. Thus, with unsegmented memory, the 
user’s virtual memory is 2° = 4 GBytes. With segmented memory, the total virtual 
memory space as seen by a user is 2*° = 64 terabytes (TBytes). The physical address 
space employs a 32-bit address for a maximum of 4 GBytes. 

The amount of virtual memory can actually be larger than the 64 TBytes. This 
is because the processor’s interpretation of a virtual address depends on which 
process is currently active. Virtual address space is divided into two parts. One-half 
of the virtual address space (8K segments X 4GBytes) is global, shared by all 
processes; the remainder is local and is distinct for each process. 

Associated with each segment are two forms of protection: privilege level and 
access attribute. There are four privilege levels, from most protected (level 0) to 
least protected (level 3). The privilege level associated with a data segment is its 
“classification”; the privilege level associated with a program segment is its “clear- 
ance.” An executing program may only access data segments for which its clearance 
level is lower than (more privileged) or equal to (same privilege) the privilege level 
of the data segment. 

The hardware does not dictate how these privilege levels are to be used; this 
depends on the OS design and implementation. It was intended that privilege level 
1 would be used for most of the OS, and level 0 would be used for that small portion 
of the OS devoted to memory management, protection, and access control. This 
leaves two levels for applications. In many systems, applications will reside at level 3, 
with level 2 being unused. Specialized application subsystems that must be pro- 
tected because they implement their own security mechanisms are good candidates 
for level 2. Some examples are database management systems, office automation 
systems, and software engineering environments. 

In addition to regulating access to data segments, the privilege mechanism limits 
the use of certain instructions. Some instructions, such as those dealing with memory- 
management registers, can only be executed in level 0. I/O instructions can only be ex- 
ecuted up to a certain level that is designated by the OS; typically, this will be level 1. 

The access attribute of a data segment specifies whether read/write or read- 
only accesses are permitted. For program segments, the access attribute specifies 
read/execute or read-only access. 

The address translation mechanism for segmentation involves mapping a 
virtual address into what is referred to as a linear address (Figure 8.20b). A virtual 
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Figure 8.20 Pentium Memory-Management Formats 
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address consists of the 32-bit offset and a 16-bit segment selector (Figure 8.20a). 
The segment selector consists of the following fields: 


e Table Indicator (TI): Indicates whether the global segment table or a local 
segment table should be used for translation. 


e Segment Number: The number of the segment. This serves as an index into the 
segment table. 


e Requested Privilege Level (RPL): The privilege level requested for this 
access. 


Each entry in a segment table consists of 64 bits, as shown in Figure 8.20c. The 
fields are defined in Table 8.5. 


Paging 


Segmentation is an optional feature and may be disabled. When segmentation is in 
use, addresses used in programs are virtual addresses and are converted into linear 
addresses, as just described. When segmentation is not in use, linear addresses are 
used in programs. In either case, the following step is to convert that linear address 
into a real 32-bit address. 

To understand the structure of the linear address, you need to know that the 
Pentium II paging mechanism is actually a two-level table lookup operation. The first 
level is a page directory, which contains up to 1024 entries. This splits the 4-GByte lin- 
ear memory space into 1024 page groups, each with its own page table, and each 
4 MBytes in length. Each page table contains up to 1024 entries; each entry corre- 
sponds to a single 4-KByte page. Memory management has the option of using one 
page directory for all processes, one page directory for each process, or some combi- 
nation of the two. The page directory for the current task is always in main memory. 
Page tables may be in virtual memory. 

Figure 8.20 shows the formats of entries in page directories and page tables, 
and the fields are defined in Table 8.5. Note that access control mechanisms can be 
provided on a page or page group basis. 

The Pentium II also makes use of a translation lookaside buffer. The buffer 
can hold 32 page table entries. Each time that the page directory is changed, the 
buffer is cleared. 

Figure 8.21 illustrates the combination of segmentation and paging mecha- 
nisms. For clarity, the translation lookaside buffer and memory cache mechanisms 
are not shown. 

Finally, the Pentium II includes a new extension not found on the 80386 or 
80486, the provision for two page sizes. If the PSE (page size extension) bit in con- 
trol register 4 is set to 1, then the paging unit permits the OS programmer to define 
a page as either 4 KByte or 4 MByte in size. 

When 4-MByte pages are used, there is only one level of table lookup for 
pages. When the hardware accesses the page directory, the page directory entry 
(Figure 8.20d) has the PS bit set to 1. In this case, bits 9 through 21 are ignored and 
bits 22 through 31 define the base address for a 4-MByte page in memory. Thus, 
there is a single page table. 


Segment Descriptor (Segment Table Entry) 


Base 
Defines the starting address of the segment within the 4-GByte linear address space. 


D/B bit 
In a code segment, this is the D bit and indicates whether operands and addressing modes are 16 or 32 bits. 


Descriptor Privilege Level (DPL) 
Specifies the privilege level of the segment referred to by this segment descriptor. 


Granularity bit (G) 

Indicates whether the Limit field is to be interpreted in units by one byte or 4 KBytes. 
Limit 

Defines the size of the segment. The processor interprets the limit field in one of two ways, depending on 
the granularity bit: in units of one byte, up to a segment size limit of 1 MByte, or in units of 4 KBytes, up to a 
segment size limit of 4 GBytes. 
S bit 

Determines whether a given segment is a system segment or a code or data segment. 


Segment Present bit (P) 
Used for nonpaged systems. It indicates whether the segment is present in main memory. For paged 
systems, this bit is always set to 1. 


Type 
Distinguishes between various kinds of segments and indicates the access attributes. 


Page Directory Entry and Page Table Entry 


Accessed bit (A) 
This bit is set to 1 by the processor in both levels of page tables when a read or write operation to the 
corresponding page occurs. 


Dirty bit (D) 
This bit is set to 1 by the processor when a write operation to the corresponding page occurs. 


Page Frame Address 

Provides the physical address of the page in memory if the present bit is set. Since page frames are aligned 
on 4K boundaries, the bottom 12 bits are 0, and only the top 20 bits are included in the entry. In a page direc- 
tory, the address is that of a page table. 


Page Cache Disable bit (PCD) 
Indicates whether data from page may be cached. 


Page Size bit (PS) 
Indicates whether page size is 4 KByte or 4 MByte. 


Page Write Through bit (PWT) 
Indicates whether write-through or write-back caching policy will be used for data in the corresponding page. 


Present bit (P) 
Indicates whether the page table or page is in main memory. 


Read/Write bit (RW) 
For user-level pages, indicates whether the page is read-only access or read/write access for user-level programs. 


User/Supervisor bit (US) 
Indicates whether the page is available only to the operating system (supervisor level) or is available to 
both operating system and applications (user level). 


Logical address 


Figure 8.21 Pentium Memory Address Translation Mechanisms 
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The use of 4-MByte pages reduces the memory-management storage require- 
ments for large main memories. With 4-KByte pages, a full 4-GByte main memory 
requires about 4 MBytes of memory just for the page tables. With 4-MByte pages, a 
single table, 4 KBytes in length, is sufficient for page memory management. 


8.5 ARM MEMORY MANAGEMENT 


ARM provides a versatile virtual memory system architecture that can be tailored 
to the needs of the embedded system designer. 


Memory System Organization 


Figure 8.22 provides an overview of the memory management hardware in the 
ARM for virtual memory. The virtual memory translation hardware uses one or two 
levels of tables for translation from virtual to physical addresses, as explained subse- 
quently. The translation lookaside buffer (TLB) is a cache of recent page table en- 
tries. If an entry is available in the TLB, then the TLB directly sends a physical 
address to main memory for a read or write operation. As explained in Chapter 4, 
data is exchanged between the processor and main memory via the cache. If a logi- 
cal cache organization is used (Figure 4.7a), then the ARM supplies that address di- 
rectly to the cache as well as supplying it to the TLB when a cache miss occurs. If a 
physical cache organization is used (Figure 4.7b), then the TLB must supply the 


physical address to the cache. 
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Figure 8.22 ARM Memory System Overview 


Entries in the translation tables also include access control bits, which deter- 
mine whether a given process may access a given portion of memory. If access is de- 
nied, access control hardware supplies an abort signal to the ARM processor. 


Virtual Memory Address Translation 
The ARM supports memory access based on either sections or pages: 


e Supersections (optional): Consist of 16-MB blocks of main memory 
e Sections: Consist of 1-MB blocks of main memory 
e Large pages: Consist of 64-KB blocks of main memory 


e Small pages: Consist of 4-KB blocks of main memory 


Sections and supersections are supported to allow mapping of a large re- 
gion of memory while using only a single entry in the TLB. Additional access con- 
trol mechanisms are extended within small pages to 1KB subpages, and within 


large pages to 16KB subpages. The translation table held in main memory has 
two levels: 


e First-level table: Holds section and supersection translations, and pointers to 
second-level tables 


e Second-level tables: Hold both large and small page translations 


The memory-management unit (MMU) translates virtual addresses generated 
by the processor into physical addresses to access main memory, and also derives 
and checks the access permission. Translations occur as the result of a TLB miss, and 
start with a first-level fetch. A section-mapped access only requires a first-level 
fetch, whereas a page-mapped access also requires a second-level fetch. 

Figure 8.23 shows the two-level address translation process for small pages. 
There is a single level 1 (L1) page table with 4K 32-bit entries. Each L1 entry points 
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Figure 8.23 ARM Virtual Memory Address Translation for Small Pages 


to a level 2 (L2) page table with 255 32-bit entries. Each of the L2 entry points to a 
4-KB page in main memory. The 32-bit virtual address is interpreted as follows: The 
most significant 12 bits are an index into the L1 page table. The next 8 bits are an 
index into the relevant L2 page table. The least significant 12 bits index a byte in the 
relevant page in main memory. 

A similar two-page lookup procedure is used for large pages. For sections and 
supersection, only the L1 page table lookup is required. 


Memory-Management Formats 


To get a better understanding of the ARM memory management scheme, we con- 
sider the key formats, as shown in Figure 8.24. The control bits shown in this figure 
are defined in Table 8.6. 

For the L1 table, each entry is a descriptor of how its associated 1-MB virtual 
address range is mapped. Each entry has one of four alternative formats: 


e Bits [1:0] = 00: The associated virtual addresses are unmapped, and attempts 
to access them generate a translation fault. 


e Bits [1:0] = 01: The entry gives the physical address of an L2 page table, 
which specifies how the associated virtual address range is mapped. 
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31 24 23 20 19 14 1211109 8 5 4 3 2 1 0 
IGN 0 0 | 
Coarse page table base address P| Domain SBZ |0 1 | 
S) tn}. |AP x 
Section base address B/ 0 S TEX AP | P| Domain C/B]1 0 
Z G xX N 
: S 
Supersection Base address n IAP Base address| X 
base address [35:32] e : G . x ee Ape [39:36] N al a 
(a) Alternative first-level descriptor formats 
31 16 15 14 1211109 8765 43 210 
IGN 0 0 
n |API x 
Small page base address G Ny x TEX | AP |C/B/} 1 N 
Large page base address = TEX |"Is baal SBZ AP |C}B/0O 1 
N G x 
(b) Alternative second-level descriptor formats 
31 24 23 20 19 0 
Level 1 table index — Supersection index 
31 20 19 0 
Level | table index Section index | 
31 20 19 12 11 0 
Level | table index Level 2 table index Page index | 
31 20 19 16 15 12 11 0 
F Level 2 : 
Level | table index Pablodniee Page index 


(c) Virtual memory address formats 


ARMv6 Memory-Management Formats 


e Bits [1:0] = 01 and bit 19 = 0: The entry is a section descriptor for its associ- 


ated virtual addresses. 


e Bits [1:0] = 01 and bit 19 = 1: The entry is a supersection descriptor for its 
associated virtual addresses. 


Entries with bits [1:0] = 11 are reserved. 
For memory structured into pages, a two-level page table access is required. 
Bits [31:10] of the L1 page entry contain a pointer to a L1 page table. For small 
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Table 8.6 ARM Memory-Management Parameters 


Access Permission (AP), Access Permission Extension (APX) 
These bits control access to the corresponding memory region. If an access is made to an area of memory 
without the required permissions, a Permission Fault is raised. 


Bufferable (B) bit 
Determines, with the TEX bits, how the write buffer is used for cacheable memory. 


Cacheable (C) bit 
Determines whether this memory region can be mapped through the cache. 


Domain 
Collection of memory regions. Access control can be applied on the basis of domain. 


not Global (nG) 


Determines whether the translation should be marked as global (0), or process specific (1). 


Shared (S) 
Determines whether the translation is for not-shared (0), or shared (1) memory. 


SBZ 
Should be zero. 


Type Extension (TEX) 
These bits, together with the B and C bits, control accesses to the caches, how the write buffer is used, and 
if the memory region is shareable and therefore must be kept coherent. 


Execute Never (XN) 
Determines whether the region is executable (0) or not executable (1). 


pages, the L2 entry contains a 20-bit pointer to the base address of a 4-KB page in 
main memory. 

For large pages, the structure is more complex. As with virtual addresses for 
small pages, a virtual address for a large page structure includes a 12-bit index into 
the level one table and an 8-bit index into the L2 table. For the 64-KB large pages, the 
page index portion of the virtual address must be 16 bits. To accommodate all of 
these bits in a 32-bit format, there is a 4-bit overlap between the page index field and 
the L2 table index field. ARM accommodates this overlap by requiring that each 
page table entry in a L2 page table that supports large pages be replicated 16 times. 
In effect, the size of the L2 page table is reduced from 256 entries to 16 entries, if all 
of the entries refer to large pages. However, a given L2 page can service a mixture of 
large and small pages, hence the need for the replication for large page entries. 

For memory structured into sections or supersections, a one-level page table 
access is required. For sections, bits [31:20] of the L1 entry contain a 12-bit pointer to 
the base of the 1-MB section in main memory. 

For supersections, bits [31:24] of the L1 entry contain an 8-bit pointer to the 
base of the 16-MB section in main memory. As with large pages, a page table entry 
replication is required. In the case of supersections, the L1 table index portion of the 
virtual address overlaps by 4 bits with the supersection index portion of the virtual 
address Therefore, 16 identical L1 page table entries are required. 

The range of physical address space can be expanded by up to eight additional 
address bits (bits [23:20] and [8:5]). The number of additional bits is implementation 
dependent. These additional bits can be interpreted as extending the size of physical 


298 CHAPTER 8 / OPERATING SYSTEM SUPPORT 


memory by as much as a factor of 2° = 256. Thus, physical memory may in fact be as 
much as 256 times as large as the memory space available to each individual process. 


Access Control 


The AP access control bits in each table entry control access to a region of memory 
by a given process. A region of memory can be designated as no access, read only, or 
read-write. Further, the region can be designated as privileged access only, reserved 
for use by the OS and not by applications. 

ARM also employs the concept of a domain, which is a collection of sections 
and/or pages that have particular access permissions. The ARM architecture sup- 
ports 16 domains. The domain feature allows multiple processes to use the same 
translation tables while maintaining some protection from each other. 

Each page table entry and TLB entry contains a field that specifies which do- 
main the entry is in. A 2-bit field in the Domain Access Control Register controls 
access to each domain. Each field allows the access to an entire domain to be en- 
abled and disabled very quickly, so that whole memory areas can be swapped in and 
out of virtual memory very efficiently. Two kinds of domain access are supported: 


e Clients: Users of domains (execute programs and access data) that must ob- 
serve the access permissions of the individual sections and/or pages that make 
up that domain 


e Managers: Control the behavior of the domain (the current sections and pages 
in the domain, and the domain access), and bypass the access permissions for 
table entries in that domain 


One program can be a client of some domains, and a manager of some other 
domains, and have no access to the remaining domains. This allows very flexible 
memory protection for programs that access different memory resources. 


8.6 RECOMMENDED READING AND WEB SITES 


[STALO09] covers the topics of this chapter in detail. 


Recommended Web sites: 


e Operating System Resource Center: A useful collection of documents and papers on a 
wide range of OS topics 

e ACM Special Interest Group on Operating Systems: Information on SIGOPS publica- 
tions and conferences 

e IEEE Technical Committee on Operating Systems and Applications: Includes an on- 
line newsletter and links to other sites 
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8.7 KEY TERMS, REVIEW QUESTIONS, AND PROBLEMS 


Key Terms 


batch system multitasking resident monitor 
demand paging nucleus segmentation 
interactive operating system operating system (OS) short-term scheduling 
interrupt paging swapping 

job control language (JCL) page table thrashing 

kernel partitioning time-sharing system 


logical address physical address translation lookaside buffer 
long-term scheduling privileged instruction (TLB) 

medium-term scheduling process utility 

memory management process control block virtual memory 

memory protection process state 

multiprogramming real memory 


Review Questions 


8.1 What is an operating system? 

8.2 List and briefly define the key services provided by an OS. 

8.3 List and briefly define the major types of OS scheduling. 

8.4 What is the difference between a process and a program? 

8.5 What is the purpose of swapping? 

8.6 Ifa process may be dynamically assigned to different locations in main memory, what 
is the implication for the addressing mechanism? 

8.7 Isit necessary for all of the pages of a process to be in main memory while the process 
is executing? 

8.8 Must the pages of a process in main memory be contiguous? 

8.9 Is it necessary for the pages of a process in main memory to be in sequential order? 

8.10 What is the purpose of a translation lookaside buffer? 


Problems 


8.1 Suppose that we have a multiprogrammed computer in which each job has identical 
characteristics. In one computation period, T, for a job, half the time is spent in I/O 
and the other half in processor activity. Each job runs for a total of N periods. Assume 
that a simple round-robin priority is used, and that I/O operations can overlap with 
processor operation. Define the following quantities: 

e Turnaround time = actual time to complete a job 

e Throughput = average number of jobs completed per time period T 

e Processor utilization = percentage of time that the processor is active (not waiting) 
Compute these quantities for one, two, and four simultaneous jobs, assuming that the 
period T is distributed in each of the following ways: 

a. I/O first half, processor second half 

b. I/O first and fourth quarters, processor second and third quarters 

8.2 An I/O-bound program is one that, if run alone, would spend more time waiting for 
I/O than using the processor. A processor-bound program is the opposite. Suppose a 
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8.4 


8.7 


8.8 


8.9 


short-term scheduling algorithm favors those programs that have used little processor 
time in the recent past. Explain why this algorithm favors I/O-bound programs and 
yet does not permanently deny processor time to processor-bound programs. 


A program computes the row sums 


Ci = 4; 
j=l 

of an array A that is 100 by 100. Assume that the computer uses demand paging with 
a page size of 1000 words, and that the amount of main memory allotted for data is 
five page frames. Is there any difference in the page fault rate if A were stored in vir- 
tual memory by rows or columns? Explain. 
Consider a fixed partitioning scheme with equal-size partitions of 2'* bytes and a total 
main memory size of 2™ bytes. A process table is maintained that includes a pointer to 
a partition for each resident process. How many bits are required for the pointer? 
Consider a dynamic partitioning scheme. Show that, on average, the memory contains 
half as many holes as segments. 
Suppose the page table for the process currently executing on the processor looks like 
the following. All numbers are decimal, everything is numbered starting from zero, 
and all addresses are memory byte addresses. The page size is 1024 bytes. 


Virtual page Page frame 
number Valid bit Reference bit Modify bit number 

0 1 1 0 

1 1 1 1 

2 0 0 0 — 
3 1 0 0 2 
4 0 0 0 _ 
5 1 0 1 0 


a. Describe exactly how, in general, a virtual address generated by the CPU is trans- 
lated into a physical main memory address. 
b. What physical address, if any, would each of the following virtual addresses corre- 
spond to? (Do not try to handle any page faults, if any.) 
@ 1052 
(ii) 2221 
(iii) 5499 
Give reasons that the page size in a virtual memory system should be neither very 
small nor very large. 
A process references five pages, A, B, C, D, and E, in the following order: 
A; B;C; D; A; B; E; A; B; C; D; E 
Assume that the replacement algorithm is first-in-first-out and find the number of 
page transfers during this sequence of references starting with an empty main mem- 
ory with three page frames. Repeat for four page frames. 
The following sequence of virtual page numbers is encountered in the course of exe- 
cution on a computer with virtual memory: 


342647132635123 


Assume that a least recently used page replacement policy is adopted. Plot a graph of 
page hit ratio (fraction of page references in which the page is in main memory) as a 
function of main-memory page capacity n for 1 = n = 8. Assume that main memory 
is initially empty. 


8.10 


8.11 


8.12 


8.14 


8.16 


8.17 
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In the VAX computer, user page tables are located at virtual addresses in the system 
space. What is the advantage of having user page tables in virtual rather than main 
memory? What is the disadvantage? 


Suppose the program statement 


for (i = 1;1<=n;i++) 
ali] = b[i] + c[iJ; 


is executed in a memory with page size of 1000 words. Let n = 1000. Using a machine 
that has a full range of register-to-register instructions and employs index registers, 
write a hypothetical program to implement the foregoing statement. Then show the 
sequence of page references during execution. 


The IBM System/370 architecture uses a two-level memory structure and refers to the 
two levels as segments and pages, although the segmentation approach lacks many of 
the features described earlier in this chapter. For the basic 370 architecture, the page 
size may be either 2 Kbytes or 4 Kbytes, and the segment size is fixed at either 
64 Kbytes or 1 MByte. For the 370/XA and 370/ESA architectures, the page size is 
4 Kbytes and the segment size is 1 MByte. Which advantages of segmentation does 
this scheme lack? What is the benefit of segmentation for the 370? 


Consider a computer system with both segmentation and paging. When a segment is 
in memory, some words are wasted on the last page. In addition, for a segment size s 
and a page size p, there are s/p page table entries. The smaller the page size, the less 
waste in the last page of the segment, but the larger the page table. What page size 
minimizes the total overhead? 


A computer has a cache, main memory, and a disk used for virtual memory. If a refer- 
enced word is in the cache, 20 ns are required to access it. If it is in main memory but 
not in the cache, 60 ns are needed to load it into the cache, and then the reference is 
started again. If the word is not in main memory, 12 ms are required to fetch the word 
from disk, followed by 60 ns to copy it to the cache, and then the reference is started 
again. The cache hit ratio is 0.9 and the main-memory hit ratio is 0.6. What is the av- 
erage time in ns required to access a referenced word on this system? 


Assume a task is divided into four equal-sized segments and that the system builds an 

eight-entry page descriptor table for each segment. Thus, the system has a combina- 

tion of segmentation and paging. Assume also that the page size is 2 KBytes. 

a. What is the maximum size of each segment? 

b. What is the maximum logical address space for the task? 

c. Assume that an element in physical location 00021ABC is accessed by this task. 
What is the format of the logical address that the task generates for it? What is the 
maximum physical address space for the system? 


Assume a microprocessor capable of accessing up to 2” bytes of physical main mem- 
ory. It implements one segmented logical address space of maximum size 2*! bytes. 
Each instruction contains the whole two-part address. External memory management 
units (MMUs) are used, whose management scheme assigns contiguous blocks of 
physical memory of fixed size 2” bytes to segments. The starting physical address of a 
segment is always divisible by 1024. Show the detailed interconnection of the external 
mapping mechanism that converts logical addresses to physical addresses using the 
appropriate number of MMUs, and show the detailed internal structure of an MMU 
(assuming that each MMU contains a 128-entry directly mapped segment descriptor 
cache) and how each MMU is selected. 


Consider a paged logical address space (composed of 32 pages of 2 KBytes each) 

mapped into a 1-MByte physical memory space. 

a. What is the format of the processor’s logical address? 

b. What is the length and width of the page table (disregarding the “access rights” 
bits)? 

c. What is the effect on the page table if the physical memory space is reduced by 
half? 
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8.18 


8.19 


8.20 


In IBM’s mainframe operating system, OS/390, one of the major modules in the ker- 
nel is the System Resource Manager (SRM). This module is responsible for the allo- 
cation of resources among address spaces (processes). The SRM gives OS/390 a 
degree of sophistication unique among operating systems. No other mainframe OS, 
and certainly no other type of OS, can match the functions performed by SRM. The 
concept of resource includes processor, real memory, and I/O channels. SRM accu- 
mulates statistics pertaining to utilization of processor, channel, and various key data 
structures. Its purpose is to provide optimum performance based on performance 
monitoring and analysis. The installation sets forth various performance objectives, 
and these serve as guidance to the SRM, which dynamically modifies installation and 
job performance characteristics based on system utilization. In turn, the SRM pro- 
vides reports that enable the trained operator to refine the configuration and para- 
meter settings to improve user service. 

This problem concerns one example of SRM activity. Real memory is divided 
into equal-sized blocks called frames, of which there may be many thousands. Each 
frame can hold a block of virtual memory referred to as a page. SRM receives control 
approximately 20 times per second and inspects each and every page frame. If the 
page has not been referenced or changed, a counter is incremented by 1. Over time, 
SRM averages these numbers to determine the average number of seconds that a 
page frame in the system goes untouched. What might be the purpose of this and what 
action might SRM take? 


For each of the ARM virtual address formats shown in Figure 8.24, show the physical 
address format. 

Draw a figure similar to Figure 8.23 for ARM virtual memory translation when main 
memory is divided into sections. 
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PART THREE 


The Central 
Processing Unit 


P.1 ISSUES FOR PART THREE 


Up to this point, we have viewed the processor essentially as a “black box” and have 
considered its interaction with I/O and memory. Part Three examines the internal 
structure and function of the processor. The processor consists of registers, the arith- 
metic and logic unit, the instruction execution unit, a control unit, and the intercon- 
nections among these components. Architectural issues, such as instruction set 
design and data types, are covered. The part also looks at organizational issues, such 
as pipelining. 


ROAD MAP FOR PART THREE 


Chapter 9 Computer Arithmetic 


Chapter 9 examines the functionality of the arithmetic and logic unit 
(ALU) and focuses on the representation of numbers and techniques for 
implementing arithmetic operations. Processors typically support two 
types of arithmetic: integer, or fixed point, and floating point. For both 
cases, the chapter first examines the representation of numbers and then 
discusses arithmetic operations. The important IEEE 754 floating-point 
standard is examined in detail. 


Chapter 10 Instruction Sets: Characteristics and Functions 


From a programmet’s point of view, the best way to understand the oper- 
ation of a processor is to learn the machine instruction set that it executes. 
The complex topic of instruction set design occupies Chapters 10 and 11. 
Chapter 10 focuses on the functional aspects of instruction set design. The 
chapter examines the types of functions that are specified by computer in- 
structions and then looks specifically at the types of operands (which spec- 
ify the data to be operated on) and the types of operations (which specify 
the operations to be performed) commonly found in instruction sets. Then 
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the relationship of processor instructions to assembly language is briefly 
explained. 


Chapter 11 Instruction Sets: Addressing Modes 
and Formats 


Whereas Chapter 10 can be viewed as dealing with the semantics of in- 
struction sets, Chapter 11 is more concerned with the syntax of instruction 
sets. Specifically, Chapter 11 looks at the way in which memory addresses 
are specified and at the overall format of computer instructions. 


Chapter 12 Processor Structure and Function 


Chapter 12 is devoted to a discussion of the internal structure and func- 
tion of the processor. The chapter describes the use of registers as the 
CPU’s internal memory and then pulls together all of the material cov- 
ered so far to provide an overview of CPU structure and function. The 
overall organization (ALU, register file, control unit) is reviewed. Then 
the organization of the register file is discussed. The remainder of the 
chapter describes the functioning of the processor in executing machine 
instructions. The instruction cycle is examined to show the function and 
interrelationship of fetch, indirect, execute, and interrupt cycles. Finally, 
the use of pipelining to improve performance is explored in depth. 


Chapter 13 Reduced Instruction Set Computers 


The remainder of Part Three looks in more detail at the key trends in 
CPU design. Chapter 13 describes the approach associated with the con- 
cept of a reduced instruction set computer (RISC), which is one of the 
most significant innovations in computer organization and architecture in 
recent years. RISC architecture is a dramatic departure from the histori- 
cal trend in processor architecture. An analysis of this approach brings 
into focus many of the important issues in computer organization and ar- 
chitecture. The chapter examines the motivation for the use of RISC de- 
sign and then looks at the details of RISC instruction set design and RISC 
CPU architecture and compares RISC with the complex instruction set 
computer (CISC) approach. 


Chapter 14 Instruction-Level Parallelism and 
Superscalar Processors 


Chapter 14 examines an even more recent and equally important design in- 
novation: the superscalar processor. Although superscalar technology can 
be used on any processor, it is especially well suited to a RISC architecture. 
The chapter also looks at the general issue of instruction-level parallelism. 
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KEY POINTS 


® The two principal concerns for computer arithmetic are the way in which 
numbers are represented (the binary format) and the algorithms used for 
the basic arithmetic operations (add, subtract, multiply, divide). These two 
considerations apply both to integer and floating-point arithmetic. 


® Floating-point numbers are expressed as a number (significand) multiplied 
by a constant (base) raised to some integer power (exponent). Floating- 
point numbers can be used to represent very large and very small numbers. 

® Most processors implement the IEEE 754 standard for floating-point rep- 
resentation and floating-point arithmetic. IEEE 754 defines both a 32-bit 
and a 64-bit format. 


We begin our examination of the processor with an overview of the arithmetic and 
logic unit (ALU). The chapter then focuses on the most complex aspect of the ALU, 
computer arithmetic. The logic functions that are part of the ALU are described in 
Chapter 10, and implementations of simple logic and arithmetic functions in digital 
logic are described in Chapter 20. 

Computer arithmetic is commonly performed on two very different types of 
numbers: integer and floating point. In both cases, the representation chosen is a crucial 
design issue and is treated first, followed by a discussion of arithmetic operations. 

This chapter includes a number of examples, each of which is highlighted in a 
shaded box. 


9.1 THE ARITHMETIC AND LOGIC UNIT 


The ALU is that part of the computer that actually performs arithmetic and logical 
operations on data. All of the other elements of the computer system—control unit, 
registers, memory, I/O—are there mainly to bring data into the ALU for it to 
process and then to take the results back out. We have, in a sense, reached the core 
or essence of a computer when we consider the ALU. 

An ALU and, indeed, all electronic components in the computer are based on 
the use of simple digital logic devices that can store binary digits and perform simple 
Boolean logic operations. For the interested reader, Chapter 20 explores digital 
logic implementation. 

Figure 9.1 indicates, in general terms, how the ALU is interconnected with the 
rest of the processor. Data are presented to the ALU in registers, and the results of 
an operation are stored in registers. These registers are temporary storage locations 
within the processor that are connected by signal paths to the ALU (e.g., see 
Figure 2.3). The ALU may also set flags as the result of an operation. For example, 
an overflow flag is set to 1 if the result of a computation exceeds the length of the 
register into which it is to be stored. The flag values are also stored in registers 


Control 
unit 


Registers —————> 
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ALU 


Figure 9.1 ALU Inputs and Outputs 


Flags 


Registers 


within the processor. The control unit provides signals that control the operation of 


the ALU and the movement of the data into and out of the ALU. 


9.2 INTEGER REPRESENTATION 


In the binary number system,! arbitrary numbers can be represented with just the 


digits zero and one, the minus sign, and the period, or radix point. 


~1101.0101, = —13.3125,, 


For purposes of computer storage and processing, however, we do not have the 
benefit of minus signs and periods. Only binary digits (0 and 1) may be used to rep- 
resent numbers. If we are limited to nonnegative integers, the representation is 


straightforward. 


An 8-bit word can represent the numbers from 0 to 255, including 


00000000 = 0 
00000001 = 1 
00101001 = 41 
10000000 = 128 
11111111 = 255 


In general, if an n-bit sequence of binary digits a,_,a,_»...a a is interpreted 
as an unsigned integer A, its value is 


n-1 
A= >2'a, 
i=0 


See Chapter 19 for a basic refresher on number systems (decimal, binary, hexadecimal). 
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Sign-Magnitude Representation 


There are several alternative conventions used to represent negative as well as pos- 
itive integers, all of which involve treating the most significant (leftmost) bit in the 
word as a sign bit. If the sign bit is 0, the number is positive; if the sign bit is 1, the 
number is negative. 

The simplest form of representation that employs a sign bit is the sign-magnitude 
representation. In an n-bit word, the rightmost n — 1 bits hold the magnitude of 
the integer. 


+18 = 00010010 
— 18  =10010010 (sign magnitude) 


The general case can be expressed as follows: 


n-2 ; 
> 2'q; if a,_, = 0 
Ais Hh (9.1) 
—Si2'a, ifa,.=1 
i=0 


Sign Magnitude 


There are several drawbacks to sign-magnitude representation. One is that addi- 
tion and subtraction require a consideration of both the signs of the numbers and their 
relative magnitudes to carry out the required operation. This should become clear in the 
discussion in Section 9.3. Another drawback is that there are two representations of 0: 


+019 = 00000000 
—0;9 | = 10000000 (sign magnitude) 


This is inconvenient because it is slightly more difficult to test for 0 (an operation 
performed frequently on computers) than if there were a single representation. 

Because of these drawbacks, sign-magnitude representation is rarely used in 
implementing the integer portion of the ALU. Instead, the most common scheme is 
twos complement representation.” 


Twos Complement Representation 


Like sign magnitude, twos complement representation uses the most significant bit 
as a sign bit, making it easy to test whether an integer is positive or negative. It dif- 
fers from the use of the sign-magnitude representation in the way that the other bits 
are interpreted. Table 9.1 highlights key characteristics of twos complement repre- 
sentation and arithmetic, which are elaborated in this section and the next. 

Most treatments of twos complement representation focus on the rules for 
producing negative numbers, with no formal proof that the scheme “works.” Instead, 


"In the literature, the terms two’s complement or 2’s complement are often used. Here we follow the prac- 
tice used in standards documents and omit the apostrophe (e.g., IEEE Std 100-1992, The New IEEE 
Standard Dictionary of Electrical and Electronics Terms). 
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Table 9.1 Characteristics of Twos Complement Representation and Arithmetic 


Range —2""| through 2""! — 1 


Number of Representations 


of Zero eae 


Take the Boolean complement of each bit of the corresponding 
Negation positive number, then add 1 to the resulting bit pattern viewed 
as an unsigned integer. 


Add additional bit positions to the left and fill in with the value 


Bape ce Biome of the original sign bit. 


If two numbers with the same sign (both positive or both 
negative) are added, then overflow occurs if and only if the 
result has the opposite sign. 


Overflow Rule 


To subtract B from A, take the twos complement of B and add 


Subtraction Rule ee 


our presentation of twos complement integers in this section and in Section 9.3 is 
based on [DATT93], which suggests that twos complement representation is best 
understood by defining it in terms of a weighted sum of bits, as we did previously for 
unsigned and sign-magnitude representations. The advantage of this treatment is 
that it does not leave any lingering doubt that the rules for arithmetic operations in 
twos complement notation may not work for some special cases. 

Consider an n-bit integer, A, in twos complement representation. If A is posi- 
tive, then the sign bit, a,_;, is zero. The remaining bits represent the magnitude of 
the number in the same fashion as for sign magnitude: 


n-2 
A= >2'a,  forA=0 
=0 


The number zero is identified as positive and therefore has a 0 sign bit and a magni- 
tude of all Os. We can see that the range of positive integers that may be represented 
is from 0 (all of the magnitude bits are 0) through 2”! — 1 (all of the magnitude bits 
are 1). Any larger number would require more bits. 

Now, for a negative number A (A < 0), the sign bit, a,,_,, is one. The remaining 
n — 1 bits can take on any one of 2”7! values. Therefore, the range of negative inte- 
gers that can be represented is from —1 to —2”~!. We would like to assign the bit val- 
ues to negative integers in such a way that arithmetic can be handled in a 
straightforward fashion, similar to unsigned integer arithmetic. In unsigned integer 
representation, to compute the value of an integer from the bit representation, the 
weight of the most significant bit is +2”~!. For a representation with a sign bit, it turns 
out that the desired arithmetic properties are achieved, as we will see in Section 9.3, 
if the weight of the most significant bit is -2”~!. This is the convention used in twos 
complement representation, yielding the following expression for negative numbers: 

n-2 


Twos Complement =A = —2”"'a,_, + >)2'q; (9.2) 
i=0 


Equation (9.2) defines the twos complement representation for both positive and 
negative numbers. For a,_, = 0, the term —2”~'a,_; = 0 and the equation defines a 
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Table 9.2 Alternative Representations for 4-Bit Integers 


Decimal Sign-Magnitude Twos Complement Biased 
Representation Representation Representation Representation 


1111 


nonnegative integer. When a,_, = 1, the term 2”"! is subtracted from the summa- 
tion term, yielding a negative integer. 

Table 9.2 compares the sign-magnitude and twos complement representations 
for 4-bit integers. Although twos complement is an awkward representation from 
the human point of view, we will see that it facilitates the most important arithmetic 
operations, addition and subtraction. For this reason, it is almost universally used as 
the processor representation for integers. 

A useful illustration of the nature of twos complement representation is a 
value box, in which the value on the far right in the box is 1 (2°) and each succeeding 
position to the left is double in value, until the leftmost position, which is negated. 
As you can see in Figure 9.2a, the most negative twos complement number that can 
be represented is —2”~'; if any of the bits other than the sign bit is one, it adds a pos- 
itive amount to the number. Also, it is clear that a negative number must have a 1 at 
its leftmost position and a positive number must have a 0 in that position. Thus, the 
largest positive number is a 0 followed by all 1s, which equals 2”~! — 1. 

The rest of Figure 9.2 illustrates the use of the value box to convert from twos 
complement to decimal and from decimal to twos complement. 


Converting between Different Bit Lengths 


It is sometimes desirable to take an n-bit integer and store it in m bits, where m > n. 
In sign-magnitude notation, this is easily accomplished: simply move the sign bit to 
the new leftmost position and fill in with zeros. 


+18 
+18 
—18 
—18 
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11283 || eb || 32) 16 8 4 2; 1 


(a) An eight-position twos complement value box 


—128 | 64 | 32 16 8 4 2 1 

1 0 0 0 0 0 1 1 

—128 +2 +1 =—125 
(b) Convert binary 10000011 to decimal 


11835 || (ob || ap? 16 8 4 D; 1 
1 0 0) 0 1 0 0 0 


—120 = —128 +8 
(c) Convert decimal — 120 to binary 


Figure 9.2 Use of a Value Box for Conversion between 
Twos Complement Binary and Decimal 


= 00010010 ( 

= _0000000000010010 (sign magnitude, 16 bits) 
= 10010010 (sign magnitude, 8 bits) 
1000000000010010 ( 


sign magnitude, 8 bits) 


sign magnitude, 16 bits) 


This procedure will not work for twos complement negative integers. Using the 
same example, 


+18 
+18 
—18 
—32,658 


= 00010010 (twos complement, 8 bits) 
= —_0000000000010010 (twos complement, 16 bits) 
= 11101110 (twos complement, 8 bits) 
= 1000000001101110 (twos complement, 16 bits) 


The next to last line is easily seen using the value box of Figure 9.2. The last line 
can be verified using Equation (9.2) or a 16-bit value box. 


Instead, 


the rule for twos complement integers is to move the sign bit to the 


new leftmost position and fill in with copies of the sign bit. For positive numbers, 
fill in with zeros, and for negative numbers, fill in with ones. This is called sign 


extension. 


—18 
—18 


= 11101110 (twos complement, 8 bits) 
=  1111111111101110 (twos complement, 16 bits) 
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To see why this rule works, let us again consider an n-bit sequence of binary 
digits a,,_,a,_2... a, interpreted as a twos complement integer A, so that its value is 


n-2 
A= -2"1a,. + >a; 
i=0 


If A is a positive number, the rule clearly works. Now, if A is negative and we want 
to construct an m-bit representation, with m > n. Then 


m—-2 : 
A= —2" 14-4 + > 2'a; 
i=0 
The two values must be equal: 


m-2 n-2 


ee 2 aan ia, 
i=0 i=0 


m—-2 : 
—2m-1 nig > 2'a; _— —pn-1 
i=n-1 
m-2 : 
gn-l fe > 2'a; = gim-1 
i=n-1 
n-2 : m-2 : m-2 ; 
1+ S2+ } a,=1+ 32! 
i=0 i=n-1 i=0 
m-2 : m—-2 ; 
¥ 2a,= 32! 
i=n-1 i=n-1 
= Am-2 va An-2 Qn-1 1 


In going from the first to the second equation, we require that the least signif- 
icant n — 1 bits do not change between the two representations. Then we get to the 
next to last equation, which is only true if all of the bits in positions n — 1 through 
m — 2 are 1. Therefore, the sign-extension rule works. The reader may find the rule 
easier to grasp after studying the discussion on twos-complement negation at the 
beginning of Section 9.3. 


Fixed-Point Representation 


Finally, we mention that the representations discussed in this section are sometimes 
referred to as fixed point. This is because the radix point (binary point) is fixed and 
assumed to be to the right of the rightmost digit. The programmer can use the same 
representation for binary fractions by scaling the numbers so that the binary point is 
implicitly positioned at some other location. 
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This section examines common arithmetic functions on numbers in twos comple- 
ment representation. 
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Negation 


In sign-magnitude representation, the rule for forming the negation of an integer is 
simple: invert the sign bit. In twos complement notation, the negation of an integer 
can be formed with the following rules: 


1. Take the Boolean complement of each bit of the integer (including the sign 
bit). That is, set each 1 to 0 and each 0 to 1. 


2. Treating the result as an unsigned binary integer, add 1. 


This two-step process is referred to as the twos complement operation, or the taking 
of the twos complement of an integer. 


+18 = 00010010 (twos complement) 
bitwise complement = 11101101 
+ 1 


11101110 = —18 


As expected, the negative of the negative of that number is itself: 


—18 11101110 (twos complement) 
bitwise complement = 00010001 
dF 1 
00010010 = +18 


We can demonstrate the validity of the operation just described using the def- 
inition of the twos complement representation in Equation (9.2). Again, interpret an 
n-bit sequence of binary digits a,_,a,_2...a,a) as a twos complement integer A, so 
that its value is 

n-2 
A= -2" ta, + > 2'a; 
i=0 
Now form the bitwise complement, @,_;@,_2...a, and, treating this is an unsigned 
integer, add 1. Finally, interpret the resulting n-bit sequence of binary digits as a 
twos complement integer B, so that its value is 
n—-2 
B= -2" aq +1+ 2G; 
i=0 
Now, we want A = —B, which means A + B = 0. This is easily shown to be true: 


n- 


2 
—(Gy_-1 + G)2”" 1 +14 (S20 + 7) ) 


i=0 
n-2 : 
Sia 2h td +( 2 
i=0 


271414 (27) - 1) 
214 2"1=90 


A+B 
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The preceding derivation assumes that we can first treat the bitwise complement of 
A as an unsigned integer for the purpose of adding 1, and then treat the result as a 
twos complement integer. There are two special cases to consider. First, consider 
A = 0. In that case, for an 8-bit representation: 


0 = 00000000 (twos complement) 
bitwise complement = 11111111 
+ 1 


100000000 = 0 


There is carry out of the most significant bit position, which is ignored. The result is 
that the negation of 0 is 0, as it should be. 

The second special case is more of a problem. If we take the negation of the bit 
pattern of 1 followed by n — 1 zeros, we get back the same number. For example, for 
8-bit words, 


—128 10000000 (twos complement) 
bitwise complement = 01111111 
4+ 1 
10000000 = —128 


Some such anomaly is unavoidable. The number of different bit patterns in an 
n-bit word is 2”, which is an even number. We wish to represent positive and nega- 
tive integers and 0. If an equal number of positive and negative integers are repre- 
sented (sign magnitude), then there are two representations for 0. If there is only 
one representation of 0 (twos complement), then there must be an unequal number 
of negative and positive numbers represented. In the case of twos complement, for 
an n-bit length, there is a representation for —2” ' but not for +2”. 


Addition and Subtraction 


Addition in twos complement is illustrated in Figure 9.3. Addition proceeds as if the 
two numbers were unsigned integers. The first four examples illustrate successful 
operations. If the result of the operation is positive, we get a positive number in twos 
complement form, which is the same as in unsigned-integer form. If the result of the 
operation is negative, we get a negative number in twos complement form. Note 
that, in some instances, there is a carry bit beyond the end of the word (indicated by 
shading), which is ignored. 

On any addition, the result may be larger than can be held in the word size 
being used. This condition is called overflow. When overflow occurs, the ALU must 
signal this fact so that no attempt is made to use the result. To detect overflow, the 
following rule is observed: 


OVERFLOW RULE: If two numbers are added, and they are both positive or 
both negative, then overflow occurs if and only if the result has the opposite sign. 
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1001 = —7 
+0101 = 5 
1110 = —2 


(a) (~7) + (+5) 


1100 = —4 
+0100 = 4 
10000 = 0 


(b) (—4) + (+4) 


(c) (+3) + (+4) 


0101 = 5 
+0100 = 4 


1001 
(e) (+5) + (+4) 


Overflow 


1100 = 
+1111 =-1 
11011 = 


(d) (—4) + (“I 


1001 = 
+1010 = -6 
10011 = 


(f)(—7) + (—6) 


Overflow 


Figure 9.3 


Representation 


Addition of Numbers in Twos Complement 
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Figures 9.3e and f show examples of overflow. Note that overflow can occur 
whether or not there is a carry. 
Subtraction is easily handled with the following rule: 


SUBTRACTION RULE: To subtract one number (subtrahend) from another 
(minuend), take the twos complement (negation) of the subtrahend and add it to 


the minuend. 


Thus, subtraction is achieved using addition, as illustrated in Figure 9.4. The 
last two examples demonstrate that the overflow rule still applies. 


0010 = 2 0101 = 5 
+1001 = —7 +1110 = —2 
1011 = —5 10011 = 3 
(a) M = 2 = 0010 (b) M = 5 = 0101 
S = 7 = 0111 S = 2 = 0010 
—-S = 1001 —-S = 1110 
1011 = —5 0101 =5 
+1110 = -—2 +0010 = 2 
11001 = -—7 0111 = 7 
(c) M=-—-—5 = 1011 (d) M= 5 = 0101 
S = 2 = 0010 S = —2 = 1110 
—-S = 1110 —-S = 0010 
0111 = 7 1010 = -6 
+0111 = 7 +1100 = -—4 
1110 = Overflow 10110 = Overflow 
(e) M= 7 = 0111 (£) M = -—6 = 1010 
Ss = —7 = 1001 S = 4 = 0100 
-S = 0111 —-S = 1100 


Figure 9.4 Subtraction of Numbers in Twos Complement 


Representation (M — S) 
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Subtraction Addition Subtraction Addition 
of positive of positive of positive of positive 
numbers 1 0000 0001 numbers numbers numbers 


1101 


0010 


0011 


9 -8-7-6-5-4-3-2-10123456789 


0100 


(a) 4-bit numbers (b) n-bit numbers 


Figure 9.5 Geometric Depiction of Twos Complement Integers 


Some insight into twos complement addition and subtraction can be gained by 
looking at a geometric depiction [BENH92], as shown in Figure 9.5. The circle in the 
upper half of each part of the figure is formed by selecting the appropriate segment 
of the number line and joining the endpoints. Note that when the numbers are laid 
out on a circle, the twos complement of any number is horizontally opposite that 
number (indicated by dashed horizontal lines). Starting at any number on the circle, 
we can add positive k (or subtract negative k) to that number by moving k positions 
clockwise, and we can subtract positive k (or add negative k) from that number by 
moving k positions counterclockwise. If an arithmetic operation results in traversal 
of the point where the endpoints are joined, an incorrect answer is given (overflow). 


All of the examples of Figures 9.3 and 9.4 are easily traced in the circle of Figure 9.5. 


Figure 9.6 suggests the data paths and hardware elements needed to accom- 
plish addition and subtraction. The central element is a binary adder, which is pre- 
sented two numbers for addition and produces a sum and an overflow indication. 
The binary adder treats the two numbers as unsigned integers. (A logic implementa- 
tion of an adder is given in Chapter 20.) For addition, the two numbers are presented 
to the adder from two registers, designated in this case as A and B registers. The re- 
sult may be stored in one of these registers or in a third. The overflow indication is 
stored in a 1-bit overflow flag (0 = no overflow; 1 = overflow). For subtraction, the 
subtrahend (B register) is passed through a twos complementer so that its twos 
complement is presented to the adder. Note that Figure 9.6 only shows the data 
paths. Control signals are needed to control whether or not the complementer is 
used, depending on whether the operation is addition or subtraction. 


9.3 / INTEGER ARITHMETIC 317 


B Register A Register 


Complementer 


Adder 


OF = Overflow bit 
SW = Switch (select addition or subtraction) 


Figure 9.6 Block Diagram of Hardware for Addition and 
Subtraction 


Multiplication 


Compared with addition and subtraction, multiplication is a complex operation, 
whether performed in hardware or software. A wide variety of algorithms have been 
used in various computers. The purpose of this subsection is to give the reader some 
feel for the type of approach typically taken. We begin with the simpler problem of 
multiplying two unsigned (nonnegative) integers, and then we look at one of the most 
common techniques for multiplication of numbers in twos complement representation. 


UNSIGNED INTEGERS Figure 9.7 illustrates the multiplication of unsigned binary inte- 
gers, as might be carried out using paper and pencil. Several important observations 
can be made: 


1. Multiplication involves the generation of partial products, one for each digit in the 
multiplier. These partial products are then summed to produce the final product. 


2. The partial products are easily defined. When the multiplier bit is 0, the partial 
product is 0. When the multiplier is 1, the partial product is the multiplicand. 


1011 Multiplicand (11) 
X1101 Multiplier (13) 
1011 
0000 : 
1011 Partial products 
1011 
10001111 Product (143) 


Figure 9.7 Multiplication of 
Unsigned Binary Integers 


318 


CHAPTER 9 / COMPUTER ARITHMETIC 


3. The total product is produced by summing the partial products. For this opera- 
tion, each successive partial product is shifted one position to the left relative to 
the preceding partial product. 


4. The multiplication of two n-bit binary integers results in a product of up to 2” 
bits in length (e.g.,11 x 11 = 1001). 


Compared with the pencil-and-paper approach, there are several things we can 
do to make computerized multiplication more efficient. First, we can perform a run- 
ning addition on the partial products rather than waiting until the end. This eliminates 
the need for storage of all the partial products; fewer registers are needed. Second, we 
can save some time on the generation of partial products. For each 1 on the multiplier, 
an add and a shift operation are required; but for each 0, only a shift is required. 

Figure 9.8a shows a possible implementation employing these measures. The 
multiplier and multiplicand are loaded into two registers (Q and M). A third register, 


Multiplicand 
M,-1 oe. Mo 


Add Shift and add 


n-bit adder A 
control logic 


Shift right 
AL 
bmi Ag ——Q,,-1 ey Q 
ae Multiplier 
(a) Block diagram 
Cc A Q M 
0 0000 1101 1011 Initial values 


@ — @alal 1101 1011 Add t First 


O- “CLOT 1110 1011 Shite cycle 
F Second 
0 0010 1111 1011 Shift t cycle 


Oo 1101 1111 1011 Add Third 
0 0110 1111 1011 Shift cycle 


0001 1111 1011 Add t Fourth 
1000 1111 1011 Shift cycle 


(b) Example from Figure 9.7 (product in A, Q) 


Figure 9.8 Hardware Implementation of Unsigned Binary Multiplication 
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START 


C,A<c0 

M < Multiplicand 
Q < Multiplier 
Count <n 


Shift right C, A, Q 
Count < Count - 1 


Product 
in A,Q 


Figure 9.9 Flowchart for Unsigned Binary Multiplication 


the A register, is also needed and is initially set to 0. There is also a 1-bit C register, 
initialized to 0, which holds a potential carry bit resulting from addition. 

The operation of the multiplier is as follows. Control logic reads the bits of the 
multiplier one at a time. If Qy is 1, then the multiplicand is added to the A register 
and the result is stored in the A register, with the C bit used for overflow. Then all of 
the bits of the C, A, and Q registers are shifted to the right one bit, so that the C bit 
goes into A,,_;, Ag goes into Q,,_;, and Qy is lost. If Qo is 0, then no addition is per- 
formed, just the shift. This process is repeated for each bit of the original multiplier. 
The resulting 2n-bit product is contained in the A and Q registers. A flowchart of the 
operation is shown in Figure 9.9, and an example is given in Figure 9.8b. Note that 
on the second cycle, when the multiplier bit is 0, there is no add operation. 


TWOS COMPLEMENT MULTIPLICATION We have seen that addition and subtrac- 
tion can be performed on numbers in twos complement notation by treating them as 
unsigned integers. Consider 


1001 
+0011 
1100 


If these numbers are considered to be unsigned integers, then we are adding 9 (1001) 
plus 3 (0011) to get 12 (1100). As twos complement integers, we are adding —7 (1001) 
to 3 (0011) to get —4 (1100). 
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1011 
x 1101 

00001011 1011 x 1 x 2° 

00000000 1011 x 0 x 2? 

00101100 1011 x 1 x 2? 

01011000 1011 x 1 x 2? 


10001111 


Figure 9.10 Multiplication of Two 
Unsigned 4-Bit Integers Yielding an 
8-Bit Result 


Unfortunately, this simple scheme will not work for multiplication. To see this, 
consider again Figure 9.7. We multiplied 11 (1011) by 13 (1101) to get 143 (10001111). 
If we interpret these as twos complement numbers, we have —5 (1011) times 
—3 (1101) equals —113 (10001111). This example demonstrates that straightforward 
multiplication will not work if both the multiplicand and multiplier are negative. In 
fact, it will not work if either the multiplicand or the multiplier is negative. To justify 
this statement, we need to go back to Figure 9.7 and explain what is being done in 
terms of operations with powers of 2. Recall that any unsigned binary number can 
be expressed as a sum of powers of 2. Thus, 


1101 =1x2?>4+1x2?4+0x2!4+1-x 2° 
=2+2?+2° 


Further, the multiplication of a binary number by 2” is accomplished by shift- 
ing that number to the left n bits. With this in mind, Figure 9.10 recasts Figure 9.7 to 
make the generation of partial products by multiplication explicit. The only differ- 
ence in Figure 9.10 is that it recognizes that the partial products should be viewed as 
2n-bit numbers generated from the n-bit multiplicand. 

Thus, as an unsigned integer, the 4-bit multiplicand 1011 is stored in an 8-bit 
word as 00001011. Each partial product (other than that for 2°) consists of this num- 
ber shifted to the left, with the unoccupied positions on the right filled with zeros 
(e.g., a shift to the left of two places yields 00101100). 

Now we can demonstrate that straightforward multiplication will not work if 
the multiplicand is negative. The problem is that each contribution of the negative 
multiplicand as a partial product must be a negative number on a 2n-bit field; the sign 
bits of the partial products must line up. This is demonstrated in Figure 9.11, which 
shows that multiplication of 1001 by 0011. If these are treated as unsigned integers, 
the multiplication of 9 x 3 = 27 proceeds simply. However, if 1001 is interpreted as 


1001 (9) 1001 (-7) 

x 0011 (3) x 0011 (3) 
00001001 1001 x 2° 11111001 (—7) x 2° = (-7) 
00010010 1001 x 23 11110010 (—7) x 21 = (-14) 
00011011 (27) 11101011 (—21) 


(a) Unsigned integers (b) Twos complement integers 


Figure 9.11 Comparison of Multiplication of Unsigned and Twos 
Complement Integers 


9.3 / INTEGER ARITHMETIC 321 


the twos complement value —7, then each partial product must be a negative twos 
complement number of 2n (8) bits, as shown in Figure 9.11b. Note that this is accom- 
plished by padding out each partial product to the left with binary 1s. 

If the multiplier is negative, straightforward multiplication also will not work. 
The reason is that the bits of the multiplier no longer correspond to the shifts or 
multiplications that must take place. For example, the 4-bit decimal number —3 is 
written 1101 in twos complement. If we simply took partial products based on each 
bit position, we would have the following correspondence: 


1101 <> -(1 x 24+1x*2?+0x 2! 41 x 2% = -(2? + 2? + 2%) 


In fact, what is desired is —(2' + 2°). So this multiplier cannot be used directly in the 
manner we have been describing. 

There are a number of ways out of this dilemma. One would be to convert 
both multiplier and multiplicand to positive numbers, perform the multiplication, 
and then take the twos complement of the result if and only if the sign of the two 
original numbers differed. Implementers have preferred to use techniques that do 
not require this final transformation step. One of the most common of these is 
Booth’s algorithm. This algorithm also has the benefit of speeding up the multiplica- 
tion process, relative to a more straightforward approach. 

Booth’s algorithm is depicted in Figure 9.12 and can be described as follows. 
As before, the multiplier and multiplicand are placed in the Q and M registers, 


START 


A€0,Q_,< 0 
M < Multiplicand 
Q< Multiplier 
Count —n 


AcA+M 


Arithmetic shift 
Right: A, Q, Q-1 
Count <— Count — 1 


Figure 9.12 Booth’s Algorithm for Twos 
Complement Multiplication 
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A Q O41 M 

0000 0011 0 0111 Initial values 

1001 0011 0 0111 AS] A a First 

1100 1001 1 0111 Shift cycle 
: t Second 

1110 0100 1 0111 Shift cycle 

OOH 0100 0111 AcAFt n Third 

0010 1010 0111 Shift cycle 
; t Fourth 

0001 0101 0 0111 Shift cycle 


Figure 9.13 Example of Booth’s Algorithm (7 X 3) 


respectively. There is also a 1-bit register placed logically to the right of the least sig- 
nificant bit (Qy) of the Q register and designated Q_,; its use is explained shortly. 
The results of the multiplication will appear in the A and Q registers. A and Q_,; are 
initialized to 0. As before, control logic scans the bits of the multiplier one at a time. 
Now, as each bit is examined, the bit to its right is also examined. If the two bits are 
the same (1-1 or 0-0), then all of the bits of the A, Q, and Q_, registers are shifted to 
the right 1 bit. If the two bits differ, then the multiplicand is added to or subtracted 
from the A register, depending on whether the two bits are 0-1 or 1-0. Following the 
addition or subtraction, the right shift occurs. In either case, the right shift is such 
that the leftmost bit of A, namely A,,_;, not only is shifted into A,,_», but also re- 
mains in A,,_;. This is required to preserve the sign of the number in A and Q. It is 
known as an arithmetic shift, because it preserves the sign bit. 

Figure 9.13 shows the sequence of events in Booth’s algorithm for the multipli- 
cation of 7 by 3. More compactly, the same operation is depicted in Figure 9.14a. The 
rest of Figure 9.14 gives other examples of the algorithm. As can be seen, it works 
with any combination of positive and negative numbers. Note also the efficiency of 
the algorithm. Blocks of 1s or Os are skipped over, with an average of only one addi- 
tion or subtraction per block. 


0111 0111 
x 0011 (0) x 1101 (0) 
11111001 = 11111001 1-0 
0000000 <a 0000111 0-1 
000111 0-1 111001 1-0 
00010101 (21) 11101011 (-21) 


(a) (7) X (3) = 21) 


(b) (7) X (~3) = (21) 


1001 1001 
x 0011 (0) x 1101 (0) 
00000111 1-0 00000111 1-0 
0000000 1-1 1111001 0-1 
111001 0-1 000111 1-0 
11101011 (—21) 00010101 (21) 
(c) (7) X (3) = (-21)) (d) (~7) X (—3) = (21) 


Figure 9.14 Examples Using Booth’s Algorithm 
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Why does Booth’s algorithm work? Consider first the case of a positive multi- 
plier. In particular, consider a positive multiplier consisting of one block of 1s sur- 
rounded by Os (for example, 00011110). As we know, multiplication can be achieved 
by adding appropriately shifted copies of the multiplicand: 


M X (00011110) = M X (24 + 23 + 2? + 2") 
=M xX (16+ 8+ 44 2) 
=M x 30 


The number of such operations can be reduced to two if we observe that 


ie Ae a (9.3) 


M x (00011110) = M x (25 — 2!) 
= M X (32 - 2) 
= M x 30 


So the product can be generated by one addition and one subtraction of the multi- 
plicand. This scheme extends to any number of blocks of 1s in a multiplier, includ- 
ing the case in which a single 1 is treated as a block. 


M xX (01111010) = M x (2® + 2° + 24 + 23 + 2!) 
=M x (27 — 23 + 2? — 2') 
Booth’s algorithm conforms to this scheme by performing a subtraction when the 
first 1 of the block is encountered (1-0) and an addition when the end of the block 
is encountered (0-1). 


To show that the same scheme works for a negative multiplier, we need to ob- 
serve the following. Let X be a negative number in twos complement notation: 


Representation of X = {1x,_2%,-3...X1Xo} 
Then the value of X can be expressed as follows: 
X = —20 1+ (xy 2 X 27) + (Xp-3 X WS) + + (xy XK 24 + (x X 29) GA) 


The reader can verify this by applying the algorithm to the numbers in Table 9.2. 
The leftmost bit of X is 1, because X is negative. Assume that the leftmost 0 is 
in the kth position. Thus, X is of the form 


Representation of X = {111...10x,_,x,_2...x,xX0} (9.5) 
Then the value of X is 
X= —2P 4 272 4-2 + DE 4 (xy X DEN) + + (xy X 2°.) 
From Equation (9.3), we can say that 


gu-2 ote Qn-3 hays gk+l = gn-1 _ gktl 
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Rearranging 
—2n-t 4 gr? 4 gms 4 ee + DEH = Okt (9.7) 
Substituting Equation (9.7) into Equation (9.6), we have 
X= —21 + (x1 X 2AM) +--+ + (xq X 2% (9.8) 


At last we can return to Booth’s algorithm. Remembering the representation 
of X [Equation (9.5)], it is clear that all of the bits from x9 up to the leftmost 0 are 
handled properly because they produce all of the terms in Equation (9.8) but 
(—2**!) and thus are in the proper form. As the algorithm scans over the leftmost 0 
and encounters the next 1 (2*'), a 1-0 transition occurs and a subtraction takes 
place (—2‘*'). This is the remaining term in Equation (9.8). 


As an example, consider the multiplication of some multiplicand by (—6). In 
twos complement representation, using an 8-bit word, (—6) is represented as 
11111010. By Equation (9.4), we know that 


ee ee es 
which the reader can easily verify. Thus, 
M X (11111010) = M x (—27 + 2° + 2° + 244+ 23 + 2!) 
Using Equation (9.7), 
M X (11111010) = M x (—23 + 2!) 


which the reader can verify is still M X (—6). Finally, following our earlier line 
of reasoning, 


M X (11111010) = M x (—2? + 2? — 2!) 


We can see that Booth’s algorithm conforms to this scheme. It performs a sub- 
traction when the first 1 is encountered (1-0), an addition when (01) is encountered, 
and finally another subtraction when the first 1 of the next block of 1s is encoun- 
tered. Thus, Booth’s algorithm performs fewer additions and subtractions than a 
more straightforward algorithm. 

Division 

Division is somewhat more complex than multiplication but is based on the same 
general principles. As before, the basis for the algorithm is the paper-and-pencil 
approach, and the operation involves repetitive shifting and addition or subtraction. 

Figure 9.15 shows an example of the long division of unsigned binary integers. 
It is instructive to describe the process in detail. First, the bits of the dividend are ex- 
amined from left to right, until the set of bits examined represents a number greater 
than or equal to the divisor; this is referred to as the divisor being able to divide the 
number. Until this event occurs, Os are placed in the quotient from left to right. When 
the event occurs, a 1 is placed in the quotient and the divisor is subtracted from the 
partial dividend. The result is referred to as a partial remainder. From this point on, 
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00001101 ~—— Quotient 
Divisor ——» 1011/10010011 ~—— Dividend 


1011} 

001110 

a 
Partial Crk 


remainders ae 
100 ~—— Remainder 


Figure 9.15 Example of Division of Unsigned 
Binary Integers 


the division follows a cyclic pattern. At each cycle, additional bits from the dividend 
are appended to the partial remainder until the result is greater than or equal to the 
divisor. As before, the divisor is subtracted from this number to produce a new par- 
tial remainder. The process continues until all the bits of the dividend are exhausted. 

Figure 9.16 shows a machine algorithm that corresponds to the long division 
process. The divisor is placed in the M register, the dividend in the Q register. At each 


START 


Ac0 

M < Divisor 
Q< Dividend 
Count <n 


Shift left 
A,Q 


Q << 0 
Ot ACA+M 


Count < Count — 1 


Quotient in Q 
Remainder in A 


Figure 9.16 Flowchart for Unsigned Binary Division 
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A Q 
0000 0111 Initial value 
0000 1110 Shift 
1101 Use twos complement of 0011 for subtraction 
1101 Subtract 
0000 1110 Restore, set Qy = 0 
0001 1100 Shift 
LOL 
1110 Subtract 
0001 1100 Restore, set Qo = 0 
0011 1000 Shift 
1101 
0000 1001 Subtract, set Qo = 1 
0001 0010 Shift 
1 aes 
1110 Subtract 
0001 0010 Restore, set Qo = 0 


Figure 9.17 Example of Restoring Twos Complement Division (7/3) 


step, the A and Q registers together are shifted to the left 1 bit. M is subtracted from 
A to determine whether A divides the partial remainder. If it does, then Qy gets a 
1 bit. Otherwise, Qy gets a 0 bit and M must be added back to A to restore the previ- 
ous value. The count is then decremented, and the process continues for n steps. At 
the end, the quotient is in the Q register and the remainder is in the A register. 

This process can, with some difficulty, be extended to negative numbers. We 
give here one approach for twos complement numbers. An example of this ap- 
proach is shown in Figure 9.17. 

The algorithm assumes that the divisor V and the dividend D are positive and 
that |V| < |D|. If |V| = |D], then the quotient Q = 1 and the remainder R = 0. If 
|V| > |D|, then Q = Oand R = D. The algorithm can be summarized as follows: 


1. Load the twos complement of the divisor into the M register; that is, the M reg- 
ister contains the negative of the divisor. Load the dividend into the A, Q reg- 
isters. The dividend must be expressed as a 2n-bit positive number. Thus, for 
example, the 4-bit 0111 becomes 00000111. 


2. Shift A, Q left 1 bit position. 
3. Perform A < A — M. This operation subtracts the divisor from the contents of A. 


4. a. If the result is nonnegative (most significant bit of A = 0), then set Q) <1. 
b. If the result is negative (most significant bit of A = 1), then set Qy <— 0 and 
restore the previous value of A. 
5. Repeat steps 2 through 4 as many times as there are bit positions in Q. 
6. The remainder is in A and the quotient is in Q. 


3This is subtraction of unsigned integers. A result that requires a borrow out of the most significant bit is 
a negative result. 
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To deal with negative numbers, we recognize that the remainder is defined by 
This is because the remainder is defined by 


D=OXV+R 


Consider the following examples of integer division with all possible combina- 
tions of signs of D and V: 


D=7 V=3 > Q=2 R#=1 
D=7 V=-3 > Q=-2 R=1 
D=-7 V=3 => Q=-2 R=-1 
D=-7 V=-3 > QO=2 R=-1 


The reader will note from Figure 9.17 that (—7)/(3) and (7)/(—3) produce dif- 
ferent remainders. We see that the magnitudes of Q and R are unaffected by the 
input signs and that the signs of Q and R are easily derivable form the signs of D and V. 
Specifically, sign(R) = sign(D) and sign(Q) = sign(D) X sign(V). Hence, one way 
to do twos complement division is to convert the operands into unsigned values and, 
at the end, to account for the signs by complementation where needed. This is the 
method of choice for the restoring division algorithm [PARHO0]. 


9.4 FLOATING-POINT REPRESENTATION 


Principles 


With a fixed-point notation (e.g., twos complement) it is possible to represent a 
range of positive and negative integers centered on 0. By assuming a fixed binary or 
radix point, this format allows the representation of numbers with a fractional com- 
ponent as well. 

This approach has limitations. Very large numbers cannot be represented, nor 
can very small fractions. Furthermore, the fractional part of the quotient in a divi- 
sion of two large numbers could be lost. 

For decimal numbers, we get around this limitation by using scientific notation. 
Thus, 976,000,000,000,000 can be represented as 9.76 < 104, and 0.0000000000000976 
can be represented as 9.76 X 10°'4. What we have done, in effect, is dynamically to 
slide the decimal point to a convenient location and use the exponent of 10 to keep 
track of that decimal point. This allows a range of very large and very small numbers 
to be represented with only a few digits. 

This same approach can be taken with binary numbers. We can represent a 
number in the form 


+S x B* 
This number can be stored in a binary word with three fields: 


e Sign: plus or minus 
e Significand S 
e Exponent E 
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Sign of 
significand ; a 
8 bits >~< 23 bits > 
[| Biased exponent Significand 
(a) Format 

1.1010001 x 279199 = Q 10010011 10100010000000000000000 = 1.6328125 x 27° 
—1.1010001 x 219199 = 1 10010011 10100010000000000000000 = —1.6328125 x 27° 
1.1010001 x 2719199 = 9 01101011 10100010000000000000000 = 1.6328125 x 277° 
—1.1010001 x 271919 = 1 01101011 10100010000000000000000 = —1.6328125 x 277° 


(b) Examples 
Figure 9.18 Typical 32-Bit Floating-Point Format 


The base B is implicit and need not be stored because it is the same for all numbers. 
Typically, it is assumed that the radix point is to the right of the leftmost, or most sig- 
nificant, bit of the significand. That is, there is one bit to the left of the radix point. 

The principles used in representing binary floating-point numbers are best ex- 
plained with an example. Figure 9.18a shows a typical 32-bit floating-point format. 
The leftmost bit stores the sign of the number (0 = positive, 1 = negative). The 
exponent value is stored in the next 8 bits. The representation used is known as a 
biased representation. A fixed value, called the bias, is subtracted from the field to 
get the true exponent value. Typically, the bias equals (2*-' — 1), where k is the 
number of bits in the binary exponent. In this case, the 8-bit field yields the numbers 
0 through 255. With a bias of 127 (2’ — 1), the true exponent values are in the range 
—127 to +128. In this example, the base is assumed to be 2. 

Table 9.2 shows the biased representation for 4-bit integers. Note that when 
the bits of a biased representation are treated as unsigned integers, the relative mag- 
nitudes of the numbers do not change. For example, in both biased and unsigned 
representations, the largest number is 1111 and the smallest number is 0000. This is 
not true of sign-magnitude or twos complement representation. An advantage of 
biased representation is that nonnegative floating-point numbers can be treated as 
integers for comparison purposes. 

The final portion of the word (23 bits in this case) is the significand.* 

Any floating-point number can be expressed in many ways. 


The following are equivalent, where the significand is expressed in binary form: 


0.110 x 2° 
110 x 2? 
0.0110 x 2° 


To simplify operations on floating-point numbers, it is typically required that they 
be normalized. A normalized number is one in which the most significant digit of 


“The term mantissa, sometimes used instead of significand, is considered obsolete. Mantissa also means 
“the fractional part of a logarithm,” so is best avoided in this context. 
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the significand is nonzero. For base 2 representation, a normalized number is there- 
fore one in which the most significant bit of the significand is one. As was men- 
tioned, the typical convention is that there is one bit to the left of the radix point. 
Thus, a normalized nonzero number is one in the form 


+1.bbb...b x 2*F 


where b is either binary digit (0 or 1). Because the most significant bit is always one, 
it is unnecessary to store this bit; rather, it is implicit. Thus, the 23-bit field is used to 
store a 24-bit significand with a value in the half open interval [1, 2). Given a num- 
ber that is not normalized, the number may be normalized by shifting the radix 
point to the right of the leftmost 1 bit and adjusting the exponent accordingly. 

Figure 9.18b gives some examples of numbers stored in this format. For each 
example, on the left is the binary number; in the center is the corresponding bit pat- 
tern; on the right is the decimal value. Note the following features: 


e The sign is stored in the first bit of the word. 


e The first bit of the true significand is always 1 and need not be stored in the 
significand field. 


e The value 127 is added to the true exponent to be stored in the exponent field. 
e The base is 2. 

For comparison, Figure 9.19 indicates the range of numbers that can be repre- 
sented in a 32-bit word. Using twos complement integer representation, all of the in- 
tegers from —2*! to 27! — 1 can be represented, for a total of 2” different numbers. 
With the example floating-point format of Figure 9.18, the following ranges of num- 
bers are possible: 

e Negative numbers between —(2 — 2°) x 2)8 and —27177 
e Positive numbers between 2°'”’ and (2 — 24) x 28 


Expressible integers 


> 


Number 


_731 0 Bi_4 line 


(a) Twos complement integers 


Negative Positive 
underflow underflow 
Negative Expressible negative Expressible positive Positive 
overflow numbers x J numbers overflow 
ae een Zero 
Number 
=(2 — 27%) x 228 27127 0 2-127 (2 — 273) x 228 line 


(b) Floating-point numbers 


Figure 9.19 Expressible Numbers in Typical 32-Bit Formats 
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Five regions on the number line are not included in these ranges: 


e Negative numbers less than —(2 — 2°) x 2'8, called negative overflow 


2-27 


e Negative numbers greater than , called negative underflow 


e Zero 


9-227 


e Positive numbers less than , called positive underflow 


e Positive numbers greater than (2 — 2°**) x 28 called positive overflow 


The representation as presented will not accommodate a value of 0. However, as 
we shall see, actual floating-point representations include a special bit pattern to des- 
ignate zero. Overflow occurs when an arithmetic operation results in a magnitude 
greater than can be expressed with an exponent of 128 (e.g.,2'7° x 2! = 2°). Un- 
derflow occurs when the fractional magnitude is too small (e.g.,2-'7° x 271° = 2°), 
Underflow is a less serious problem because the result can generally be satisfactorily 
approximated by 0. 

It is important to note that we are not representing more individual values 
with floating-point notation. The maximum number of different values that can be 
represented with 32 bits is still 2°. What we have done is to spread those numbers 
out in two ranges, one positive and one negative. In practice, most floating-point 
numbers that one would wish to represent are represented only approximately. 
However, for moderate sized integers, the representation is exact. 

Also, note that the numbers represented in floating-point notation are not 
spaced evenly along the number line, as are fixed-point numbers. The possible val- 
ues get closer together near the origin and farther apart as you move away, as shown 
in Figure 9.20. This is one of the trade-offs of floating-point math: Many calculations 
produce results that are not exact and have to be rounded to the nearest value that 
the notation can represent. 

In the type of format depicted in Figure 9.18, there is a trade-off between 
range and precision. The example shows 8 bits devoted to the exponent and 23 to 
the significand. If we increase the number of bits in the exponent, we expand the 
range of expressible numbers. But because only a fixed number of different values 
can be expressed, we have reduced the density of those numbers and therefore the 
precision. The only way to increase both range and precision is to use more bits. 
Thus, most computers offer, at least, single-precision numbers and double-precision 
numbers. For example, a single-precision format might be 32 bits, and a double- 
precision format 64 bits. 

So there is a trade-off between the number of bits in the exponent and the 
number of bits in the significand. But it is even more complicated than that. The 
implied base of the exponent need not be 2. The IBM S/390 architecture, for exam- 
ple, uses a base of 16 [ANDE67b]. The format consists of a 7-bit exponent and a 
24-bit significand. 


—n 0 n 2n 4n 


Figure 9.20 Density of Floating-Point Numbers 
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In the IBM base-16 format, 


0.11010001 x 21° = 0.11010001 x 161 


and the exponent is stored to represent 5 rather than 20. 


The advantage of using a larger exponent is that a greater range can be 
achieved for the same number of exponent bits. But remember, we have not 
increased the number of different values that can be represented. Thus, for a 
fixed format, a larger exponent base gives a greater range at the expense of less 
precision. 


IEEE Standard for Binary Floating-Point Representation 


The most important floating-point representation is defined in IEEE Standard 
754, adopted in 1985. This standard was developed to facilitate the portability of 
programs from one processor to another and to encourage the development of 
sophisticated, numerically oriented programs. The standard has been widely 
adopted and is used on virtually all contemporary processors and arithmetic 
coprocessors. 

The IEEE standard defines both a 32-bit single and a 64-bit double format 
(Figure 9.21), with 8-bit and 11-bit exponents, respectively. The implied base is 2. In 
addition, the standard defines two extended formats, single and double, whose exact 
format is implementation dependent. The extended formats include additional bits 
in the exponent (extended range) and in the significand (extended precision). The 
extended formats are to be used for intermediate calculations. With their greater 
precision, the extended formats lessen the chance of a final result that has been con- 
taminated by excessive roundoff error; with their greater range, they also lessen the 
chance of an intermediate overflow aborting a computation whose final result 
would have been representable in a basic format. An additional motivation for the 
single extended format is that it affords some of the benefits of a double format 
without incurring the time penalty usually associated with higher precision. Table 9.3 
summarizes the characteristics of the four formats. 


<8 Bits—>~< 23 Bits > 


bit 
‘Ny Biased 


exponent 


Fraction 


(a) Single format 


Sign ~<— 11 Bits—~< 52 Bits > 


Biased 
exponent 


Fraction 


(b) Double format 
Figure 9.21 TEEE 754 Formats 


332 CHAPTER 9 / COMPUTER ARITHMETIC 


Table 9.3. JTEEE 754 Format Parameters 


Format 
Parameter Single Extended Double Double Extended 


Word width (bits) 
Exponent width (bits) 


Exponent bias unspecified unspecified 
Maximum exponent 127 =1023 1023 = 16383 
Minimum exponent —126 =—1022 —1022 =—16382 
Number range (base 10) iO, Noy unspecified Ome banl Ouse’ unspecified 
Significand width (bits)* 23 231 52 =63 


Number of exponents 254 unspecified unspecified 


Number of fractions 7) unspecified unspecified 


Number of values 1.98 x 22! unspecified unspecified 


*not including implied bit 


Not all bit patterns in the IEEE formats are interpreted in the usual way; in- 
stead, some bit patterns are used to represent special values. Table 9.4 indicates the 
values assigned to various bit patterns. The extreme exponent values of all zeros (0) 
and all ones (255 in single format, 2047 in double format) define special values. The 
following classes of numbers are represented: 


e For exponent values in the range of 1 through 254 for single format and 1 
through 2046 for double format, normalized nonzero floating-point numbers 
are represented. The exponent is biased, so that the range of exponents is 
—126 through +127 for single format and —1022 through +1023. A normal- 
ized number requires a 1 bit to the left of the binary point; this bit is implied, 
giving an effective 24-bit or 53-bit significand (called fraction in the standard). 


e An exponent of zero together with a fraction of zero represents positive or 
negative zero, depending on the sign bit. As was mentioned, it is useful to have 
an exact value of 0 represented. 


e An exponent of all ones together with a fraction of zero represents positive or 
negative infinity, depending on the sign bit. It is also useful to have a represen- 
tation of infinity. This leaves it up to the user to decide whether to treat over- 
flow as an error condition or to carry the value co and proceed with whatever 
program is being executed. 


e An exponent of zero together with a nonzero fraction represents a denormal- 
ized number. In this case, the bit to the left of the binary point is zero and the 
true exponent is —126 or —1022. The number is positive or negative depend- 
ing on the sign bit. 


e Anexponent of all ones together with a nonzero fraction is given the value NaN, 
which means Not a Number, and is used to signal various exception conditions. 


The significance of denormalized numbers and NaNs is discussed in Section 9.5. 


coe 


Interpretation of IEEE 754 Floating-Point Numbers 


Single Precision (32 bits) 


Double Precision (64 bits) 


Biased 
exponent 


Fraction 


Biased 
exponent 


Fraction 


positive zero 


0 


0 


0 


0 


negative zero 


0 


0 


0 


0 


plus infinity 


255 (all 1s) 


2047 (all 1s) 


minus infinity 


255 (all 1s) 


2047 (all 1s) 


quiet NaN 


255 (all 1s) 


2047 (all 1s) 


signaling NaN 


255 (all 1s) 


2047 (all 1s) 


positive 
normalized 
nonzero 


0<e < 255 


Dome l(let) 


0<e< 2047 


es p23 (let) 


negative 
normalized 
nonzero 


positive 
denormalized 


0<e < 255 


=2°-P1(1.£) 


2°-176(0.f) 


0<e< 2047 


ae (1) 


ee 1022((0.f) 


negative 
denormalized 


2° 6(0,f) 


=D (Oi) 
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9.5 FLOATING-POINT ARITHMETIC 


Table 9.5 summarizes the basic operations for floating-point arithmetic. For addi- 
tion and subtraction, it is necessary to ensure that both operands have the same ex- 
ponent value. This may require shifting the radix point on one of the operands to 
achieve alignment. Multiplication and division are more straightforward. 

A floating-point operation may produce one of these conditions: 


e Exponent overflow: A positive exponent exceeds the maximum possible expo- 
nent value. In some systems, this may be designated as +00 or —o0, 


e Exponent underflow: A negative exponent is less than the minimum possible 
exponent value (e.g., —200 is less than —127). This means that the number is 
too small to be represented, and it may be reported as 0. 


e Significand underflow: In the process of aligning significands, digits may flow 
off the right end of the significand. As we shall discuss, some form of rounding 
is required. 


e Significand overflow: The addition of two significands of the same sign may re- 
sult in a carry out of the most significant bit. This can be fixed by realignment, 
as we shall explain. 


Addition and Subtraction 


In floating-point arithmetic, addition and subtraction are more complex than multi- 
plication and division. This is because of the need for alignment. There are four basic 
phases of the algorithm for addition and subtraction: 

1. Check for zeros. 

2. Align the significands. 

3. Add or subtract the significands. 

4. Normalize the result. 


Table 9.5 Floating-Point Numbers and Arithmetic Operations 


Floating Point Numbers Arithmetic Operations 


BES Me PS BE AR NY SOG Me POO tb VA) Se EEG 
VY = Vo K Be X —Y = (Xz X B**-Y — Y,) x BY* 


be, <Y, 


BOS = (OK VQ) Oe 


= (F) SK BrrYe 
Ys 


Examples: 


X = 03 x 10° = 30 
Y = 0.2 x 10° = 200 


X+Y = (03 x 10-3 + 0.2) x 10° = 0.23 x 10° = 230 

X — Y = (0.3 x 10°73 — 0.2) x 10° = (—0.17) x 10° = -170 
X XY = (0.3 X 0.2) x 10?*3 = 0.06 x 10° = 6000 

X +Y = (03 + 0.2) x 10°73 =1.5 x 1071 = 0.15 
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A typical flowchart is shown in Figure 9.22. A step-by-step narrative highlights 
the main functions required for floating-point addition and subtraction. We assume 
a format similar to those of Figure 9.21. For the addition or subtraction operation, 
the two operands must be transferred to registers that will be used by the ALU. If 
the floating-point format includes an implicit significand bit, that bit must be made 
explicit for the operation. 


Phase 1: Zero check. Because addition and subtraction are identical except for a 
sign change, the process begins by changing the sign of the subtrahend if it is a 
subtract operation. Next, if either operand is 0, the other is reported as the result. 


Phase 2: Significand alignment. The next phase is to manipulate the numbers 
so that the two exponents are equal. 


To see the need for aligning exponents, consider the following decimal addition: 


(123 x 10°) + (456 x 10°’) 


Clearly, we cannot just add the significands. The digits must first be set into equiva- 
lent positions, that is, the 4 of the second number must be aligned with the 3 of the 
first. Under these conditions, the two exponents will be equal, which is the mathe- 
matical condition under which two numbers in this form can be added. Thus, 


(123 x 10°) + (456 x 102) = (123 x 10°) + (4.56 X 10°) = 127.56 x 10° 


Alignment may be achieved by shifting either the smaller number to the 
right (increasing its exponent) or shifting the larger number to the left. Because 
either operation may result in the loss of digits, it is the smaller number that is 
shifted; any digits that are lost are therefore of relatively small significance. The 
alignment is achieved by repeatedly shifting the magnitude portion of the signifi- 
cand right 1 digit and incrementing the exponent until the two exponents are 
equal. (Note that if the implied base is 16, a shift of 1 digit is a shift of 4 bits.) If this 
process results in a 0 value for the significand, then the other number is reported 
as the result. Thus, if two numbers have exponents that differ significantly, the less- 
er number is lost. 


Phase 3: Addition. Next, the two significands are added together, taking into 
account their signs. Because the signs may differ, the result may be 0. There is 
also the possibility of significand overflow by 1 digit. If so, the significand of 
the result is shifted right and the exponent is incremented. An exponent over- 
flow could occur as a result; this would be reported and the operation halted. 


Phase 4: Normalization. The final phase normalizes the result. Normalization 
consists of shifting significand digits left until the most significant digit (bit, or 
4 bits for base-16 exponent) is nonzero. Each shift causes a decrement of the 
exponent and thus could cause an exponent underflow. Finally, the result must 
be rounded off and then reported. We defer a discussion of rounding until 
after a discussion of multiplication and division. 


9C€ 
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Change 
sign of Y 


RETURN 


Figure 9.22 Floating-Point Addition and Subtraction (Z<— Z + Y) 
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MULTIPLY 


Add 
exponents 


! 


Subtract bias 


RETURN 


Report 
overflow 
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overflow? 
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! 


Normalize 


! 


Round 


Figure 9.23 Floating-Point Multiplication (Z< X X Y) 


Multiplication and Division 


Floating-point multiplication and division are much simpler processes than addition 
and subtraction, as the following discussion indicates. 

We first consider multiplication, illustrated in Figure 9.23. First, if either 
operand is 0, 0 is reported as the result. The next step is to add the exponents. If the 
exponents are stored in biased form, the exponent sum would have doubled the 
bias. Thus, the bias value must be subtracted from the sum. The result could be either 
an exponent overflow or underflow, which would be reported, ending the algorithm. 

If the exponent of the product is within the proper range, the next step is to 
multiply the significands, taking into account their signs. The multiplication is per- 
formed in the same way as for integers. In this case, we are dealing with a sign— 
magnitude representation, but the details are similar to those for twos complement 
representation. The product will be double the length of the multiplier and multipli- 
cand. The extra bits will be lost during rounding. 
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Figure 9.24 Floating-Point Division (Z <— X/Y) 


After the product is calculated, the result is then normalized and rounded, as 
was done for addition and subtraction. Note that normalization could result in ex- 
ponent underflow. 

Finally, let us consider the flowchart for division depicted in Figure 9.24. 
Again, the first step is testing for 0. If the divisor is 0, an error report is issued, or the 
result is set to infinity, depending on the implementation. A dividend of 0 results 
in 0. Next, the divisor exponent is subtracted from the dividend exponent. This 
removes the bias, which must be added back in. Tests are then made for exponent 
underflow or overflow. 

The next step is to divide the significands. This is followed with the usual nor- 
malization and rounding. 


Precision Considerations 


GUARD BITS We mentioned that, prior to a floating-point operation, the exponent 
and significand of each operand are loaded into ALU registers. In the case of the 
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= 1,0OOscccs OO « 2 % = ,10O000 « 16" 
ie Osis es il x 2 -y = .OFFFFF x 16* 
B= O,000c ccc On « 28 z = ,000001 x 16 
= 1. O000ccces OO « a = ,100000 « 16° 
(a) Binary example, without guard bits (c) Hexadecimal example, without guard bits 
x = iL, COOssc66 00 0000 x 2+ x = .100000 00 x 16+ 
7 S Osi. eee ii 1000 « 2 -y = .OFFFFF FO x 16+ 
B= O000s6c06 O00 10O0 « 2 z = ,.000000 10 x 16+ 
= 15 000s o4 54 00 0000 x 2° = 1OCOO 08 x 16~ 
(b) Binary example, with guard bits (d) Hexadecimal example, with guard bits 


Figure 9.25 The Use of Guard Bits 


significand, the length of the register is almost always greater than the length of the 
significand plus an implied bit. The register contains additional bits, called guard 
bits, which are used to pad out the right end of the significand with Os. 


The reason for the use of guard bits is illustrated in Figure 9.25. Consider num- 
bers in the IEEE format, which has a 24-bit significand, including an implied 
1 bit to the left of the binary point. Two numbers that are very close in value are 
x = 1.00...00 x 2! and y = 1.11...11 2°. If the smaller number is to be sub- 
tracted from the larger, it must be shifted right 1 bit to align the exponents. This is 
shown in Figure 9.25a. In the process, y loses 1 bit of significance; the result is 2-*. 
The same operation is repeated in part (b) with the addition of guard bits. Now the 
least significant bit is not lost due to alignment, and the result is 2-*°, a difference of 
a factor of 2 from the previous answer. When the radix is 16, the loss of precision 
can be greater. As Figures 9.25c and d show, the difference can be a factor of 16. 


ROUNDING Another detail that affects the precision of the result is the rounding 
policy. The result of any operation on the significands is generally stored in a longer 
register. When the result is put back into the floating-point format, the extra bits 
must be disposed of. 

A number of techniques have been explored for performing rounding. In fact, 
the IEEE standard lists four alternative approaches: 


e Round to nearest: The result is rounded to the nearest representable number. 
e Round toward +: The result is rounded up toward plus infinity. 
e Round toward —©: The result is rounded down toward negative infinity. 
e Round toward 0: The result is rounded toward zero. 
Let us consider each of these policies in turn. Round to nearest is the default 


rounding mode listed in the standard and is defined as follows: The representable 
value nearest to the infinitely precise result shall be delivered. 


340 


CHAPTER 9 / COMPUTER ARITHMETIC 


If the extra bits, beyond the 23 bits that can be stored, are 10010, then the extra 
bits amount to more than one-half of the last representable bit position. In this 
case, the correct answer is to add binary 1 to the last representable bit, rounding 
up to the next representable number. Now consider that the extra bits are 01111. 
In this case, the extra bits amount to less than one-half of the last representable 
bit position. The correct answer is simply to drop the extra bits (truncate), which 
has the effect of rounding down to the next representable number. 


The standard also addresses the special case of extra bits of the form 10000... . 
Here the result is exactly halfway between the two possible representable values. 
One possible technique here would be to always truncate, as this would be the sim- 
plest operation. However, the difficulty with this simple approach is that it intro- 
duces a small but cumulative bias into a sequence of computations. What is required 
is an unbiased method of rounding. One possible approach would be to round up or 
down on the basis of a random number so that, on average, the result would be un- 
biased. The argument against this approach is that it does not produce predictable, 
deterministic results. The approach taken by the IEEE standard is to force the result 
to be even: If the result of a computation is exactly midway between two repre- 
sentable numbers, the value is rounded up if the last representable bit is currently 1 
and not rounded up if it is currently 0. 

The next two options, rounding to plus and minus infinity, are useful in imple- 
menting a technique known as interval arithmetic. Interval arithmetic provides an 
efficient method for monitoring and controlling errors in floating-point computa- 
tions by producing two values for each result. The two values correspond to the 
lower and upper endpoints of an interval that contains the true result. The width of 
the interval, which is the difference between the upper and lower endpoints, indi- 
cates the accuracy of the result. If the endpoints of an interval are not representable, 
then the interval endpoints are rounded down and up, respectively. Although the 
width of the interval may vary according to implementation, many algorithms have 
been designed to produce narrow intervals. If the range between the upper and 
lower bounds is sufficiently narrow, then a sufficiently accurate result has been ob- 
tained. If not, at least we know this and can perform additional analysis. 

The final technique specified in the standard is round toward zero. This is, in 
fact, simple truncation: The extra bits are ignored. This is certainly the simplest tech- 
nique. However, the result is that the magnitude of the truncated value is always less 
than or equal to the more precise original value, introducing a consistent bias toward 
zero in the operation. This is a serious bias because it affects every operation for 
which there are nonzero extra bits. 


IEEE Standard for Binary Floating-Point Arithmetic 


IEEE 754 goes beyond the simple definition of a format to lay down specific prac- 
tices and procedures so that floating-point arithmetic produces uniform, predictable 
results independent of the hardware platform. One aspect of this has already been 
discussed, namely rounding. This subsection looks at three other topics: infinity, 
NaNs, and denormalized numbers. 
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INFINITY Infinity arithmetic is treated as the limiting case of real arithmetic, with 
the infinity values given the following interpretation: 
—oo < (every finite number) < +00 


With the exception of the special cases discussed subsequently, any arithmetic 
operation involving infinity yields the obvious result. 


For example: 


5 + (+00) = +00 5 + (+00) = +0 
5 — (+00) = —00 (+00) + (+00) = +00 
5 + (—00) = —0o (—0o0) + (—co) = —00 
5 — (—co) = +00 (—0o0o) — (+00) = —00 
5 X (+00) = +00 (+00) — (—oo) = +00 


QUIET AND SIGNALING NaNS A NaN is a symbolic entity encoded in floating- 
point format, of which there are two types: signaling and quiet. A signaling NaN sig- 
nals an invalid operation exception whenever it appears as an operand. Signaling 
NaNs afford values for uninitialized variables and arithmetic-like enhancements 
that are not the subject of the standard. A quiet NaN propagates through almost 
every arithmetic operation without signaling an exception. Table 9.6 indicates oper- 
ations that will produce a quiet NaN. 

Note that both types of NaNs have the same general format (Table 9.4): an ex- 
ponent of all ones and a nonzero fraction. The actual bit pattern of the nonzero frac- 
tion is implementation dependent; the fraction values can be used to distinguish 
quiet NaNs from signaling NaNs and to specify particular exception conditions. 


Table 9.6 Operations that Produce a Quiet NaN 


Operation Quiet NaN Produced by 


Any Any operation on a signaling NaN 


Magnitude subtraction of infinities: 


(+00) + (00) 
Add or subtract (—00) + (+00) 
( 


+00) — (+00) 
ocean 
Multiply 0 xX oc 


eee 0 co 
Division Be ee 
0 [e-e) 


Remainder x REM 0 or co REM y 


Square root Vx, where x < 0 
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Q~ 124 


(a) 32-Bit format without denormalized numbers 


Uniform 
spacing 


7124 


(b) 32-Bit format with denormalized numbers 


Figure 9.26 The Effect of IEEE 754 Denormalized Numbers 


DENORMALIZED NUMBERS Denormalized numbers are included in IEEE 754 to 
handle cases of exponent underflow. When the exponent of the result becomes too 
small (a negative exponent with too large a magnitude), the result is denormalized 
by right shifting the fraction and incrementing the exponent for each shift until the 
exponent is within a representable range. 

Figure 9.26 illustrates the effect of including denormalized numbers. The repre- 
sentable numbers can be grouped into intervals of the form [2”, 2”*']. Within each such 
interval, the exponent portion of the number remains constant while the fraction 
varies, producing a uniform spacing of representable numbers within the interval. As 
we get closer to zero, each successive interval is half the width of the preceding interval 
but contains the same number of representable numbers. Hence the density of repre- 
sentable numbers increases as we approach zero. However, if only normalized numbers 
are used, there is a gap between the smallest normalized number and 0. In the case of 
the 32-bit IEEE 754 format, there are 2” representable numbers in each interval, and 
the smallest representable positive number is 2”'*°. With the addition of denormalized 
numbers, an additional 2* — 1 numbers are uniformly added between 0 and 27!”°. 

The use of denormalized numbers is referred to as gradual underflow 
[COONS81]. Without denormalized numbers, the gap between the smallest repre- 
sentable nonzero number and zero is much wider than the gap between the smallest 
representable nonzero number and the next larger number. Gradual underflow fills 
in that gap and reduces the impact of exponent underflow to a level comparable 
with roundoff among the normalized numbers. 


9.6 RECOMMENDED READING AND WEB SITES 


[ERCE04] and [PARH00] are excellent treatments of computer arithmetic, covering all of 
the topics in this chapter in detail. [FLYNO1] is a useful discussion that focuses on practical 
design and implementation issues. For the serious student of computer arithmetic, a very 
useful reference is the two-volume [SWAR90]. Volume I was originally published in 1980 
and provides key papers (some very difficult to obtain otherwise) on computer arithmetic 
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fundamentals. Volume II contains more recent papers, covering theoretical, design, and im- 
plementation aspects of computer arithmetic. 

For floating-point arithmetic, [GOLD91] is well named: “What Every Computer Scien- 
tist Should Know About Floating-Point Arithmetic.” Another excellent treatment of the topic 
is contained in [KNUT98], which also covers integer computer arithmetic. The following 
more in-depth treatments are also worthwhile: [OVER01, EVEN00a, OBER97a, OBER97b, 
SODE96]. [KUCK77] is a good discussion of rounding methods in floating-point arithmetic. 
[EVENO00b] examines rounding with respect to IEEE 754. 

[SCHW99] describes the first IBM S/390 processor to integrate radix-16 and IEEE 754 
floating-point arithmetic in the same floating-point unit. 


ERCE04 Ercegovac, M., and Lang, T. Digital Arithmetic. San Francisco: Morgan 
Kaufmann, 2004. 

EVEN00a_ Even, G., and Paul, W. “On the Design of IEEE Compliant Floating-Point 
Units.” IEEE Transactions on Computers, May 2000. 

EVEN00b_ Even, G., and Seidel, P. “A Comparison of Three Rounding Algorithms for 
IEEE Floating-Point Multiplication.” IEEE Transactions on Computers, July 2000. 

FLYNO1 Flynn, M., and Oberman, S. Advanced Computer Arithmetic Design. New York: 
Wiley, 2001. 

GOLD91_ Goldberg, D. “What Every Computer Scientist Should Know About Floating- 
Point Arithmetic.” ACM Computing Surveys, March 1991. 

KNUT98_ Knuth, D. The Art of Computer Programming, Volume 2: Seminumerical Algo- 
rithms. Reading, MA: Addison-Wesley, 1998. 

KUCK77 Kuck, D.; Parker, D.; and Sameh, A. “An Analysis of Rounding Methods in 
Floating-Point Arithmetic.” IEEE Transactions on Computers. July 1977. 

OBER97a_ Oberman, S., and Flynn, M. “Design Issues in Division and Other Floating- 
Point Operations.” [EEE Transactions on Computers, February 1997. 

OBER97b Oberman, S., and Flynn, M. “Division Algorithms and Implementations.” 
IEEE Transactions on Computers, August 1997. 

OVERO1 Overton, M. Numerical Computing with LEEE Floating Point Arithmetic. 
Philadelphia, PA: Society for Industrial and Applied Mathematics, 2001. 

PARHO00 Parhami, B. Computer Arithmetic: Algorithms and Hardware Design. Oxford: 
Oxford University Press, 2000. 

SCHW99_ Schwarz, E., and Krygowski, C. “The S/390 G5 Floating-Point Unit.” IBM 
Journal of Research and Development, September/November 1999. 

SODE96_ Soderquist, P., and Leeser, M. “Area and Performance Tradeoffs in Floating- 
Point Divide and Square-Root Implementations.” ACM Computing Surveys, 
September 1996. 

SWAR90_ Swartzlander, E., editor. Computer Arithmetic, Volumes I and IT. Los Alamitos, 
CA: IEEE Computer Society Press, 1990. 


Recommended Web site: 


e IEEE 754: The IEEE 754 documents, related publications and papers, and a useful set 
of links related to computer arithmetic 
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9.7 KEY TERMS, REVIEW QUESTIONS, AND PROBLEMS 


Key Terms 


arithmetic and logic unit 
(ALU) 

arithmetic shift 

base 

biased representation 
denormalized number 
dividend 


divisor 


exponent 

exponent overflow 

exponent underflow 
fixed-point representation 
floating-point representation 


mantissa 

minuend 

multiplicand 

multiplier 

negative overflow 

negative underflow 

normalized number 

ones complement 
representation 

overflow 

partial product 

positive overflow 

positive underflow 


quotient 

radix point 

remainder 

rounding 

sign bit 

significand 

significand overflow 

significand underflow 

sign-magnitude 
representation 

subtrahend 

twos complement 
representation 


guard bits product 


Review Questions 


9.1 Briefly explain the following representations: sign magnitude, twos complement, 
biased. 


9.2 Explain how to determine if a number is negative in the following representations: 
sign magnitude, twos complement, biased. 


9.3 What is the sign-extension rule for twos complement numbers? 
9.4 How can you form the negation of an integer in twos complement representation? 


9.5 In general terms, when does the twos complement operation on an n-bit integer pro- 
duce the same integer? 


9.6 What is the difference between the twos complement representation of a number and 
the twos complement of a number? 


9.7 If we treat 2 twos complement numbers as unsigned integers for purposes of addition, 
the result is correct if interpreted as a twos complement number. This is not true for 
multiplication. Why? 


9.8 What are the four essential elements of a number in floating-point notation? 


9.9 What is the benefit of using biased representation for the exponent portion of a 
floating-point number? 


9.10 What are the differences among positive overflow, exponent overflow, and significand 
overflow? 


9.11 What are the basic elements of floating-point addition and subtraction? 
9.12 Give a reason for the use of guard bits. 
9.13 List four alternative methods of rounding the result of a floating-point operation. 


Problems 


9.1 Represent the following decimal numbers in both binary sign/magnitude and twos 
complement using 16 bits: +512; —29. 


9.2 Represent the following twos complement values in decimal: 1101011; 0101101. 


9.3 


9.4 
9.5 


9.6 


9.7 


9.8 


9.9 
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Another representation of binary integers that is sometimes encountered is ones 

complement. Positive integers are represented in the same way as sign magnitude. A 

negative integer is represented by taking the Boolean complement of each bit of the 

corresponding positive number. 

a. Provide a definition of ones complement numbers using a weighted sum of bits, 
similar to Equations (9.1) and (9.2). 

b. What is the range of numbers that can be represented in ones complement? 

c. Define an algorithm for performing addition in ones complement arithmetic. 

Note: Ones complement arithmetic disappeared from hardware in the 1960s, but still 

survives checksum calculations for the Internet Protocol (IP) and the Transmission 

Control Protocol (TCP). 

Add columns to Table 9.1 for sign magnitude and ones complement. 


Consider the following operation on a binary word. Start with the least significant bit. 

Copy all bits that are 0 until the first bit is reached and copy that bit, too. Then take 

the complement of each bit thereafter. What is the result? 

In Section 9.3, the twos complement operation is defined as follows. To find the twos 

complement of X, take the Boolean complement of each bit of X, and then add 1. 

a. Show that the following is an equivalent definition. For an n-bit integer X, the 
twos complement of X is formed by treating X as an unsigned integer and calcu- 
lating (2, — X). 

b. Demonstrate that Figure 9.5 can be used to support graphically the claim in part 
(a), by showing how a clockwise movement is used to achieve subtraction. 

The r’s complement of an n-digit number N in base r is defined as r” — N for N # 0 

and 0 for N = 0. Find the tens complement of the decimal number 13250. 

Calculate (72530 — 13250) using tens complement arithmetic. Assume rules similar 

to those for twos complement arithmetic. 


Consider the twos complement addition of two n-bit numbers: 
Zn n-2 +++ 20 = Xn-1¥%n-2+++ XO F Vn-1Yn-2+++ Yo 


Assume that bitwise addition is performed with a carry bit c; generated by the addi- 
tion of x;, y;, and c;_,. Let v be a binary variable indicating overflow when v = 1. Fill 
in the values in the table. 


Input Vn-1 0 0 1 1 0 0 1 1 


Output 


9.10 


9.11 


9.12 


Assume numbers are represented in 8-bit twos complement representation. Show the 
calculation of the following: 


a 6+ 13 b. —6 + 13 ec. 6 — 13 d. —6 — 13 

Find the following differences using twos complement arithmetic: 

a. 111000 b. 11001100 e. 111100001111 d. 11000011 
—110011 — 101110 —110011110011 —11101000 


Is the following a valid alternative definition of overflow in twos complement 
arithmetic? 
If the exclusive-OR of the carry bits into and out of the leftmost column is 1, then 
there is an overflow condition. Otherwise, there is not. 
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9.13 
9.14 


9.15 


9.16 


9.17 


9.18 


9.19 


9.20 


9.21 


9.22 


9.23 


9.24 


9.25 


9.26 


9.27 


Compare Figures 9.9 and 9.12. Why is the C bit not used in the latter? 

Given x = 0101 and y = 1010 in twos complement notation (i.e., x = 5, y = —6), 

compute the product p = x X y with Booth’s algorithm. 

Use the Booth algorithm to multiply 23 (multiplicand) by 29 (multiplier), where each 

number is represented using 6 bits. 

Prove that the multiplication of two n-digit numbers in base B gives a product of no 

more than 2n digits. 

Verify the validity of the unsigned binary division algorithm of Figure 9.16 by show- 

ing the steps involved in calculating the division depicted in Figure 9.15. Use a pre- 

sentation similar to that of Figure 9.17. 

The twos complement integer division algorithm described in Section 9.3 is known as 

the restoring method because the value in the A register must be restored following 

unsuccessful subtraction. A slightly more complex approach, known as nonrestoring, 
avoids the unnecessary subtraction and addition. Propose an algorithm for this latter 
approach. 

Under computer integer arithmetic, the quotient J/K of two integers J and K is less 

than or equal to the usual quotient. True or false? 

Divide —145 by 13 in binary twos complement notation, using 12-bit words. Use the 

algorithm described in Section 9.3. 

a. Consider a fixed-point representation using decimal digits, in which the implied 
radix point can be in any position (e.g., to the right of the least significant digit, to 
the right of the most significant digit, and so on). How many decimal digits are 
needed to represent the approximations of both Planck’s constant (6.63 x 107°’) 
and Avogadro’s number (6.02 x 107)? The implied radix point must be in the 
same position for both numbers. 

b. Now consider a decimal floating-point format with the exponent stored in a 
biased representation with a bias of 50. A normalized representation is assumed. 
How many decimal digits are needed to represent these constants in this floating- 
point format? 

Assume that the exponent e is constrained to lie in the range 0 = e = X, witha bias 

of q, that the base is b, and that the significand is p digits in length. 

a. What are the largest and smallest positive values that can be written? 

b. What are the largest and smallest positive values that can be written as normal- 
ized floating-point numbers? 

Express the following numbers in IEEE 32-bit floating-point format: 

a. —5 b. —6 ce. -1.5 d. 384 e. 1/16 f. —1/32 

The following numbers use the IEEE 32-bit floating-point format. What is the equiv- 

alent decimal value? 

a. 110000011 11000000000000000000000 

b. 001111110 10100000000000000000000 

c. 0 10000000 00000000000000000000000 

Consider a reduced 7-bit IEEE floating-point format, with 3 bits for the exponent and 

3 bits for the significand. List all 127 values. 

Express the following numbers in IBM’s 32-bit floating-point format, which uses a 7-bit 

exponent with an implied base of 16 and an exponent bias of 64 (40 hexadecimal). A 

normalized floating-point number requires that the leftmost hexadecimal digit be 

nonzero; the implied radix point is to the left of that digit. 


a. 1.0 c. 1/64 e. —15.0 g. 7.2 x 107% 
b. 0.5 d. 0.0 f. 5.4 x 10°” h. 65535 


Let 5BCA0000 be a floating-point number in IBM format, expressed in hexadecimal. 
What is the decimal value of the number? 


9.28 


9.32 


9.33 
9.34 


\o 
te 
on 


9.36 


9.37 


9.38 


9.39 


9.40 
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What would be the bias value for 

a. A base-2 exponent (B = 2) ina 6-bit field? 

b. A base-8 exponent (B = 8) ina 7-bit field? 

Draw a number line similar to that in Figure 9.19b for the floating-point format of 
Figure 9.21b. 

Consider a floating-point format with 8 bits for the biased exponent and 23 bits for 
the significand. Show the bit pattern for the following numbers in this format: 

a. —720 b. 0.645 

The text mentions that a 32-bit format can represent a maximum of 2” different num- 
bers. How many different numbers can be represented in the IEEE 32-bit format? 
Explain. 

Any floating-point representation used in a computer can represent only certain real 
numbers exactly; all others must be approximated. If A’ is the stored value approxi- 
mating the real value A, then the relative error, r, is expressed as 


_A-A’ 


; A 


Represent the decimal quantity +0.4 in the following floating-point format: base = 2; 
exponent: biased, 4 bits; significand, 7 bits. What is the relative error? 


If A = 1.427, find the relative error if A is truncated to 1.42 and if it is rounded to 1.43. 


When people speak about inaccuracy in floating-point arithmetic, they often ascribe 
errors to cancellation that occurs during the subtraction of nearly equal quantities. 
But when X and Y are approximately equal, the difference X¥ — Y is obtained exact- 
ly, with no error. What do these people really mean? 


Numerical values A and B are stored in the computer as approximations A’ and B’. 
Neglecting any further truncation or roundoff errors, show that the relative error of 
the product is approximately the sum of the relative errors in the factors. 


One of the most serious errors in computer calculations occurs when two nearly equal 
numbers are subtracted. Consider A = 0.22288 and B = 0.22211. The computer trun- 
cates all values to four decimal digits. Thus A’ = 0.2228 and B’ = 0.2221. 

a. What are the relative errors for A’ and B'? 

b. What is the relative error for C’ = A’ — B'? 


To get some feel for the effects of denormalization and gradual underflow, consider a 
decimal system that provides 6 decimal digits for the significand and for which the 
smallest normalized number is 10°”. A normalized number has one nonzero decimal 
digit to the left of the decimal point. Perform the following calculations and denor- 
malize the results. Comment on the results. 

a. (2.50000 x 10%) x (3.50000 x 10) 

b. (2.50000 x 10°) x (3.50000 x 10°°°) 

c. (5.67834 X 10°*’) — (5.67812 x 10°”) 

Show how the following floating-point additions are performed (where significands 
are truncated to 4 decimal digits). Show the results in normalized form. 

a. 5.566 X 10° + 7.777 x 10° b. 3.344 x 10' + 8.877 x 10° 

Show how the following floating-point subtractions are performed (where significands 
are truncated to 4 decimal digits). Show the results in normalized form. 

a. 7.744 x 107 — 6.666 x 10° b. 8.844 x 10% — 2.233 x 107 

Show how the following floating-point calculations are performed (where significands 


are truncated to 4 decimal digits). Show the results in normalized form. 
a. (2.255 X 10') X (1.234 x 10°) b. (8.833 x 10°) + (5.555 X 104) 
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KEY POINTS 


® The essential elements of a computer instruction are the opcode, which 
specifies the operation to be performed; the source and destination 
operand references, which specify the input and output locations for the op- 
eration; and a next instruction reference, which is usually implicit. 


® Opcodes specify operations in one of the following general categories: 
arithmetic and logic operations; movement of data between two registers, 
register and memory, or two memory locations; I/O; and control. 


® Operand references specify a register or memory location of operand data. 
The type of data may be addresses, numbers, characters, or logical data. 


® A common architectural feature in processors is the use of a stack, which 
may or may not be visible to the programmer. Stacks are used to manage 
procedure calls and returns and may be provided as an alternative form of 
addressing memory. The basic stack operations are PUSH, POP, and opera- 
tions on the top one or two stack locations. Stacks typically are implemented 
to grow from higher addresses to lower addresses. 


® Byte-addressable processors may be categorized as big endian, little endi- 
an, or bi-endian. A multibyte numerical value stored with the most signifi- 
cant byte in the lowest numerical address is stored in big-endian fashion. 
The little-endian style stores the most significant byte in the highest numeri- 
cal address. A bi-endian processor can handle both styles. 


Much of what is discussed in this book is not readily apparent to the user or program- 
mer of a computer. If a programmer is using a high-level language, such as Pascal or 
Ada, very little of the architecture of the underlying machine is visible. 

One boundary where the computer designer and the computer programmer can 
view the same machine is the machine instruction set. From the designer’s point of view, 
the machine instruction set provides the functional requirements for the processor: im- 
plementing the processor is a task that in large part involves implementing the machine 
instruction set. The user who chooses to program in machine language (actually, in as- 
sembly language; see Appendix B) becomes aware of the register and memory structure, 
the types of data directly supported by the machine, and the functioning of the ALU. 

A description of a computer’s machine instruction set goes a long way toward ex- 
plaining the computer’s processor. Accordingly, we focus on machine instructions in 
this chapter and the next. 


10.1 MACHINE INSTRUCTION CHARACTERISTICS 


The operation of the processor is determined by the instructions it executes, referred 
to as machine instructions or computer instructions. The collection of different instruc- 
tions that the processor can execute is referred to as the processotr’s instruction set. 
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Instruction Operand Operand 
fetch fetch store 
Multiple Multiple 
operands results 
Instruction, Instruction Operand Dat. Operand 
address operation ——» address * address 
calculation decoding calculation eperaticy calculation 


Return for string 
or vector data 


Instruction complete, 
fetch next instruction 


Figure 10.1 Instruction Cycle State Diagram 


Elements of a Machine Instruction 


Each instruction must contain the information required by the processor for execu- 
tion. Figure 10.1, which repeats Figure 3.6, shows the steps involved in instruction 
execution and, by implication, defines the elements of a machine instruction. These 
elements are as follows: 


e Operation code: Specifies the operation to be performed (e.g., ADD, I/O). 
The operation is specified by a binary code, known as the operation code, 
or opcode. 


e Source operand reference: The operation may involve one or more source 
operands, that is, operands that are inputs for the operation. 


e Result operand reference: The operation may produce a result. 


e Next instruction reference: This tells the processor where to fetch the next in- 
struction after the execution of this instruction is complete. 


The address of the next instruction to be fetched could be either a real address 
or a virtual address, depending on the architecture. Generally, the distinction is 
transparent to the instruction set architecture. In most cases, the next instruction to 
be fetched immediately follows the current instruction. In those cases, there is no ex- 
plicit reference to the next instruction. When an explicit reference is needed, then 
the main memory or virtual memory address must be supplied. The form in which 
that address is supplied is discussed in Chapter 11. 

Source and result operands can be in one of four areas: 


e Main or virtual memory: As with next instruction references, the main or vir- 
tual memory address must be supplied. 


e Processor register: With rare exceptions, a processor contains one or more regis- 
ters that may be referenced by machine instructions. If only one register exists, 
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4 Bits 6 Bits 6 Bits 
Opcode Operand reference Operand reference 
x 16 Bits > 


Figure 10.2 A Simple Instruction Format 


reference to it may be implicit. If more than one register exists, then each register 
is assigned a unique name or number, and the instruction must contain the num- 
ber of the desired register. 


e Immediate: The value of the operand is contained in a field in the instruction 
being executed. 


e I/O device: The instruction must specify the I/O module and device for the op- 
eration. If memory-mapped I/O is used, this is just another main or virtual 
memory address. 


Instruction Representation 


Within the computer, each instruction is represented by a sequence of bits. The in- 
struction is divided into fields, corresponding to the constituent elements of the in- 
struction. A simple example of an instruction format is shown in Figure 10.2. As 
another example, the IAS instruction format is shown in Figure 2.2. With most 
instruction sets, more than one format is used. During instruction execution, an in- 
struction is read into an instruction register (IR) in the processor. The processor 
must be able to extract the data from the various instruction fields to perform the 
required operation. 

It is difficult for both the programmer and the reader of textbooks to deal with 
binary representations of machine instructions. Thus, it has become common prac- 
tice to use a symbolic representation of machine instructions. An example of this was 
used for the IAS instruction set, in Table 2.1. 

Opcodes are represented by abbreviations, called mnemonics, that indicate the 
operation. Common examples include 


ADD Add 

SUB Subtract 

MUL Multiply 

DIV Divide 

LOAD Load data from memory 
STOR Store data to memory 


Operands are also represented symbolically. For example, the instruction 
ADDR, Y 


may mean add the value contained in data location Y to the contents of register R. 
In this example, Y refers to the address of a location in memory, and R refers to a 
particular register. Note that the operation is performed on the contents of a loca- 
tion, not on its address. 
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Thus, it is possible to write a machine-language program in symbolic form. 
Each symbolic opcode has a fixed binary representation, and the programmer spec- 
ifies the location of each symbolic operand. For example, the programmer might 
begin with a list of definitions: 


X = 513 
Y = 514 


and so on.A simple program would accept this symbolic input, convert opcodes and 
operand references to binary form, and construct binary machine instructions. 

Machine-language programmers are rare to the point of nonexistence. Most 
programs today are written in a high-level language or, failing that, assembly lan- 
guage, which is discussed in Appencix B. However, symbolic machine language 
remains a useful tool for describing machine instructions, and we will use it for 
that purpose. 


Instruction Types 


Consider a high-level language instruction that could be expressed in a language 
such as BASIC or FORTRAN. For example, 


xX=X+yY 


This statement instructs the computer to add the value stored in Y to the value 
stored in X and put the result in X. How might this be accomplished with machine 
instructions? Let us assume that the variables X and Y correspond to locations 513 
and 514. If we assume a simple set of machine instructions, this operation could be 
accomplished with three instructions: 


1. Load a register with the contents of memory location 513. 
2. Add the contents of memory location 514 to the register. 


3. Store the contents of the register in memory location 513. 


As can be seen, the single BASIC instruction may require three machine in- 
structions. This is typical of the relationship between a high-level language and a 
machine language. A high-level language expresses operations in a concise algebraic 
form, using variables. A machine language expresses operations in a basic form in- 
volving the movement of data to or from registers. 

With this simple example to guide us, let us consider the types of instructions 
that must be included in a practical computer. A computer should have a set of in- 
structions that allows the user to formulate any data processing task. Another way 
to view it is to consider the capabilities of a high-level programming language. Any 
program written in a high-level language must be translated into machine language 
to be executed. Thus, the set of machine instructions must be sufficient to express 
any of the instructions from a high-level language. With this in mind we can catego- 
rize instruction types as follows: 


e Data processing: Arithmetic and logic instructions 
e Data storage: Movement of data into or out of register and or memory locations 
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e Data movement: I/O instructions 


e Control: Test and branch instructions 


Arithmetic instructions provide computational capabilities for processing nu- 
meric data. Logic (Boolean) instructions operate on the bits of a word as bits rather 
than as numbers; thus, they provide capabilities for processing any other type of 
data the user may wish to employ. These operations are performed primarily on data 
in processor registers. Therefore, there must be memory instructions for moving 
data between memory and the registers. //O instructions are needed to transfer pro- 
grams and data into memory and the results of computations back out to the user. 
Test instructions are used to test the value of a data word or the status of a compu- 
tation. Branch instructions are then used to branch to a different set of instructions 
depending on the decision made. 

We will examine the various types of instructions in greater detail later in this 
chapter. 


Number of Addresses 


One of the traditional ways of describing processor architecture is in terms of the 
number of addresses contained in each instruction. This dimension has become less 
significant with the increasing complexity of processor design. Nevertheless, it is 
useful at this point to draw and analyze this distinction. 

What is the maximum number of addresses one might need in an instruction? 
Evidently, arithmetic and logic instructions will require the most operands. Virtu- 
ally all arithmetic and logic operations are either unary (one source operand) 
or binary (two source operands). Thus, we would need a maximum of two ad- 
dresses to reference source operands. The result of an operation must be stored, 
suggesting a third address, which defines a destination operand. Finally, after 
completion of an instruction, the next instruction must be fetched, and its address 
is needed. 

This line of reasoning suggests that an instruction could plausibly be required 
to contain four address references: two source operands, one destination operand, 
and the address of the next instruction. In most architectures, most instructions have 
one, two, or three operand addresses, with the address of the next instruction being 
implicit (obtained from the program counter). Most architectures also have a few 
special-purpose instructions with more operands. For example, the load and store 
multiple instructions of the ARM architecture, described in Chapter 11, designate 
up to 17 register operands in a single instruction. 

Figure 10.3 compares typical one-, two-, and three-address instructions that 
could be used to compute Y = (A — B)/[C + (D X E)]. With three addresses, 
each instruction specifies two source operand locations and a destination operand 
location. Because we choose not to alter the value of any of the operand locations, a 
temporary location, T, is used to store some intermediate results. Note that there are 
four instructions and that the original expression had five operands. 

Three-address instruction formats are not common because they require a rel- 
atively long instruction format to hold the three address references. With two- 
address instructions, and for binary operations, one address must do double duty as 
both an operand and a result. Thus, the instruction SUB Y, B carries out the calculation 
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Instruction Comment 

SUB Y,A,B YceA-B 

MPY T,D,E TCDXE 

ADD T,T,C TeT+C 

DIN Xe yet YeY+T 

(a) Three-address instructions nceencaen Coe 

LOAD D ACeED 

Instruction Comment MPY E AC<AC XE 
ADD C AC AC+C 

MOVE Y, A YoA eaey ae 

Bue eee cat LOAD A ACEA 

eae ites SUB B AC©AC-B 

suing ae DIV Y ACeAC+Y 

ADD T,C TeaTeE eee. ae 

DIV. Y,T Yo Y+T 


(c) One-address instructions 
A-—B 
C+(DXE) 


(b) Two-address instructions 


Figure 10.3 Programs to Execute Y = 


Y — Band stores the result in Y. The two-address format reduces the space require- 
ment but also introduces some awkwardness. To avoid altering the value of an 
operand, a MOVE instruction is used to move one of the values to a result or tem- 
porary location before performing the operation. Our sample program expands to 
six instructions. 

Simpler yet is the one-address instruction. For this to work, a second address 
must be implicit. This was common in earlier machines, with the implied address 
being a processor register known as the accumulator (AC). The accumulator con- 
tains one of the operands and is used to store the result. In our example, eight in- 
structions are needed to accomplish the task. 

It is, in fact, possible to make do with zero addresses for some instructions. 
Zero-address instructions are applicable to a special memory organization, called 
a stack. A stack is a last-in-first-out set of locations. The stack is in a known loca- 
tion and, often, at least the top two elements are in processor registers. Thus, zero- 
address instructions would reference the top two stack elements. Stacks are 
described in Appendix 10A. Their use is explored further later in this chapter and 
in Chapter 11. 

Table 10.1 summarizes the interpretations to be placed on instructions with 
zero, one, two, or three addresses. In each case in the table, it is assumed that the ad- 
dress of the next instruction is implicit, and that one operation with two source 
operands and one result operand is to be performed. 

The number of addresses per instruction is a basic design decision. Fewer ad- 
dresses per instruction result in instructions that are more primitive, requiring a less 
complex processor. It also results in instructions of shorter length. On the other hand, 
programs contain more total instructions, which in general results in longer execution 
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Table 10.1 Utilization of Instruction Addresses (Nonbranching Instructions) 


Number of Addresses Symbolic Representation Interpretation 


A<—BOPC 
A<—AOPB 
AC<—ACOPA 
T<(T -1)OPT 


AC = accumulator 
T = top of stack 
(T — 1) = second element of stack 


A, B, C = memory or register locations 


times and longer, more complex programs. Also, there is an important threshold be- 
tween one-address and multiple-address instructions. With one-address instructions, 
the programmer generally has available only one general-purpose register, the accu- 
mulator. With multiple-address instructions, it is common to have multiple general- 
purpose registers. This allows some operations to be performed solely on registers. 
Because register references are faster than memory references, this speeds up execu- 
tion. For reasons of flexibility and ability to use multiple registers, most contempo- 
rary machines employ a mixture of two- and three-address instructions. 

The design trade-offs involved in choosing the number of addresses per in- 
struction are complicated by other factors. There is the issue of whether an address 
references a memory location or a register. Because there are fewer registers, fewer 
bits are needed for a register reference. Also, as we shall see in the next chapter, a 
machine may offer a variety of addressing modes, and the specification of mode 
takes one or more bits. The result is that most processor designs involve a variety of 
instruction formats. 


Instruction Set Design 


One of the most interesting, and most analyzed, aspects of computer design is instruc- 
tion set design. The design of an instruction set is very complex because it affects so 
many aspects of the computer system. The instruction set defines many of the functions 
performed by the processor and thus has a significant effect on the implementation of 
the processor. The instruction set is the programmer’s means of controlling the proces- 
sor. Thus, programmer requirements must be considered in designing the instruction set. 

It may surprise you to know that some of the most fundamental issues relating 
to the design of instruction sets remain in dispute. Indeed, in recent years, the level 
of disagreement concerning these fundamentals has actually grown. The most im- 
portant of these fundamental design issues include the following: 


¢ Operation repertoire: How many and which operations to provide, and how 
complex operations should be 
e Data types: The various types of data upon which operations are performed 


e Instruction format: Instruction length (in bits), number of addresses, size of 
various fields, and so on 
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e Registers: Number of processor registers that can be referenced by instructions, 
and their use 


e Addressing: The mode or modes by which the address of an operand is specified 


These issues are highly interrelated and must be considered together in de- 
signing an instruction set. This book, of course, must consider them in some se- 
quence, but an attempt is made to show the interrelationships. 

Because of the importance of this topic, much of Part Three is devoted to in- 
struction set design. Following this overview section, this chapter examines data types 
and operation repertoire. Chapter 11 examines addressing modes (which includes a 
consideration of registers) and instruction formats. Chapter 13 examines the reduced 
instruction set computer (RISC). RISC architecture calls into question many of the 
instruction set design decisions traditionally made in commercial computers. 


10.2 TYPES OF OPERANDS 


Machine instructions operate on data. The most important general categories of 
data are 


e Addresses 
e Numbers 

e Characters 
e Logical data 


We shall see, in discussing addressing modes in Chapter 11, that addresses are, 
in fact, a form of data. In many cases, some calculation must be performed on the 
operand reference in an instruction to determine the main or virtual memory ad- 
dress. In this context, addresses can be considered to be unsigned integers. 

Other common data types are numbers, characters, and logical data, and each 
of these is briefly examined in this section. Beyond that, some machines define spe- 
cialized data types or data structures. For example, there may be machine opera- 
tions that operate directly on a list or a string of characters. 


Numbers 


All machine languages include numeric data types. Even in nonnumeric data pro- 
cessing, there is a need for numbers to act as counters, field widths, and so forth. An 
important distinction between numbers used in ordinary mathematics and numbers 
stored in a computer is that the latter are limited. This is true in two senses. First, 
there is a limit to the magnitude of numbers representable on a machine and second, 
in the case of floating-point numbers, a limit to their precision. Thus, the programmer 
is faced with understanding the consequences of rounding, overflow, and underflow. 
Three types of numerical data are common in computers: 


e Binary integer or binary fixed point 
e Binary floating point 
e Decimal 
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We examined the first two in some detail in Chapter 9. It remains to say a few words 
about decimal numbers. 

Although all internal computer operations are binary in nature, the human 
users of the system deal with decimal numbers. Thus, there is a necessity to convert 
from decimal to binary on input and from binary to decimal on output. For applica- 
tions in which there is a great deal of I/O and comparatively little, comparatively 
simple computation, it is preferable to store and operate on the numbers in decimal 
form. The most common representation for this purpose is packed decimal.' 

With packed decimal, each decimal digit is represented by a 4-bit code, in the ob- 
vious way, with two digits stored per byte. Thus, 0 = 0000, 1 = 0001,...,8 = 1000, 
and 9 = 1001. Note that this is a rather inefficient code because only 10 of 16 possi- 
ble 4-bit values are used. To form numbers, 4-bit codes are strung together, usually 
in multiples of 8 bits. Thus, the code for 246 is 0000 0010 0100 0110. This code is 
clearly less compact than a straight binary representation, but it avoids the conver- 
sion overhead. Negative numbers can be represented by including a 4-bit sign digit 
at either the left or right end of a string of packed decimal digits. Standard sign val- 
ues are 1100 for positive (+) and 1101 for negative (—). 

Many machines provide arithmetic instructions for performing operations di- 
rectly on packed decimal numbers. The algorithms are quite similar to those de- 
scribed in Section 9.3 but must take into account the decimal carry operation. 


Characters 


A common form of data is text or character strings. While textual data are most con- 
venient for human beings, they cannot, in character form, be easily stored or trans- 
mitted by data processing and communications systems. Such systems are designed 
for binary data. Thus, a number of codes have been devised by which characters are 
represented by a sequence of bits. Perhaps the earliest common example of this is 
the Morse code. Today, the most commonly used character code in the International 
Reference Alphabet (IRA), referred to in the United States as the American Stan- 
dard Code for Information Interchange (ASCII; see Appendix F). Each character in 
this code is represented by a unique 7-bit pattern; thus, 128 different characters can 
be represented. This is a larger number than is necessary to represent printable 
characters, and some of the patterns represent control characters. Some of these 
control characters have to do with controlling the printing of characters on a page. 
Others are concerned with communications procedures. IRA-encoded characters 
are almost always stored and transmitted using 8 bits per character. The eighth bit 
may be set to 0 or used as a parity bit for error detection. In the latter case, the bit is 
set such that the total number of binary 1s in each octet is always odd (odd parity) 
or always even (even parity). 

Note in Table F1 (Appendix F) that for the IRA bit pattern 011XXXX, the 
digits 0 through 9 are represented by their binary equivalents, 0000 through 1001, in 


'Textbooks often refer to this as binary coded decimal (BCD). Strictly speaking, BCD refers to the 
encoding of each decimal digit by a unique 4-bit sequence. Packed decimal refers to the storage of BCD- 
encoded digits using one byte for each two digits. 
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the rightmost 4 bits. This is the same code as packed decimal. This facilitates conver- 
sion between 7-bit IRA and 4-bit packed decimal representation. 

Another code used to encode characters is the Extended Binary Coded 
Decimal Interchange Code (EBCDIC). EBCDIC is used on IBM mainframes. It 
is an 8-bit code. As with IRA, EBCDIC is compatible with packed decimal. In the 
case of EBCDIC, the codes 11110000 through 11111001 represent the digits 0 
through 9. 


Logical Data 


Normally, each word or other addressable unit (byte, halfword, and so on) is treated 
as a single unit of data. It is sometimes useful, however, to consider an n-bit unit as 
consisting of n 1-bit items of data, each item having the value 0 or 1. When data are 
viewed this way, they are considered to be logical data. 

There are two advantages to the bit-oriented view. First, we may sometimes 
wish to store an array of Boolean or binary data items, in which each item can take 
on only the values 1 (true) and 0 (false). With logical data, memory can be used most 
efficiently for this storage. Second, there are occasions when we wish to manipulate 
the bits of a data item. For example, if floating-point operations are implemented in 
software, we need to be able to shift significant bits in some operations. Another ex- 
ample: To convert from IRA to packed decimal, we need to extract the rightmost 
4 bits of each byte. 

Note that, in the preceding examples, the same data are treated sometimes as 
logical and other times as numerical or text. The “type” of a unit of data is deter- 
mined by the operation being performed on it. While this is not normally the case in 
high-level languages, it is almost always the case with machine language. 
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x86 Data Types 


The x86 can deal with data types of 8 (byte), 16 (word), 32 (doubleword), 64 (quad- 
word), and 128 (double quadword) bits in length. To allow maximum flexibility in 
data structures and efficient memory utilization, words need not be aligned at even- 
numbered addresses; doublewords need not be aligned at addresses evenly divisible 
by 4; and quadwords need not be aligned at addresses evenly divisible by 8, and so 
on. However, when data are accessed across a 32-bit bus, data transfers take place in 
units of doublewords, beginning at addresses divisible by 4. The processor converts 
the request for misaligned values into a sequence of requests for the bus transfer. As 
with all of the Intel 80x86 machines, the x86 uses the little-endian style; that is, the 
least significant byte is stored in the lowest address (see Appendix 10B for a discus- 
sion of endianness). 

The byte, word, doubleword, quadword, and double quadword are referred to 
as general data types. In addition, the x86 supports an impressive array of specific 
data types that are recognized and operated on by particular instructions. Table 10.2 
summarizes these types. 


Table 10.2 x86 Data Types 


Data Type 
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Description 


General 
Integer 


Ordinal 


Unpacked binary coded 
decimal (BCD) 


Packed BCD 


Near pointer 


Byte, word (16 bits), doubleword (32 bits), quadword (64 bits), and 
double quadword (128 bits) locations with arbitrary binary contents. 


A signed binary value contained in a byte, word, or doubleword, using 
twos complement representation. 


An unsigned integer contained in a byte, word, or doubleword. 


A representation of a BCD digit in the range 0 through 9, with one 
digit in each byte. 


Packed byte representation of two BCD digits; value in the range 0 to 99. 
A 16-bit, 32-bit, or 64-bit effective address that represents the offset 


within a segment. Used for all pointers in a nonsegmented memory and 
for references within a segment in a segmented memory. 


Far pointer A logical address consisting of a 16-bit segment selector and an offset 
of 16, 32, or 64 bits. Far pointers are used for memory references in a 
segmented memory model where the identity of a segment being 


accessed must be specified explicitly. 


Bit field A contiguous sequence of bits in which the position of each bit is 
considered as an independent unit. A bit string can begin at any bit 


position of any byte and can contain up to 32 bits. 


A contiguous sequence of bits, containing from zero to 2” — 1 bits. 


Bit string 


Byte string A contiguous sequence of bytes, words, or doublewords, containing from 


zero to 2* — 1 bytes. 
See Figure 10.4. 
Packed 64-bit and 128-bit data types 


Floating point 


Packed SIMD (single 
instruction, multiple data) 


Figure 10.4 illustrates the x86 numerical data types. The signed integers are in 
twos complement representation and may be 16, 32, or 64 bits long. The floating- 
point type actually refers to a set of types that are used by the floating-point unit 
and operated on by floating-point instructions. The three floating-point representa- 
tions conform to the IEEE 754 standard. 

The packed SIMD (single-instruction-multiple-data) data types were intro- 
duced to the x86 architecture as part of the extensions of the instruction set to opti- 
mize performance of multimedia applications. These extensions include MMX 
(multimedia extensions) and SSE (streaming SIMD extensions). The basic concept 
is that multiple operands are packed into a single referenced memory item and that 
these multiple operands are operated on in parallel. The data types are as follows: 


e Packed byte and packed byte integer: Bytes packed into a 64-bit quadword or 
128-bit double quadword, interpreted as a bit field or as an integer 


e Packed word and packed word integer: 16-bit words packed into a 64-bit quad- 
word or 128-bit double quadword, interpreted as a bit field or as an integer 


e Packed doubleword and packed doubleword integer: 32-bit doublewords packed 
into a 64-bit quadword or 128-bit double quadword, interpreted as a bit field or 
as an integer 
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Byte unsigned integer 


7 0 
Word unsigned integer 
15 0 
Doubleword unsigned integer 
31 0 
Quadword unsigned integer 
63 0 
Twos comp Byte signed integer 
7 O 


Twos comp| Word unsigned integer 


15 0 
Twos complement Doubleword unsigned integer 
31 0 
Twos complement Quadward unsigned integer 
63 0 
Sign bit Eee of 
ae ingle precision 
Exp Significand floating-point 
' . 31 0 
Sign bit en ve 
ey ouble precision 
Eee Siemens Floating point 
; . 6351 0 
Sign bit Integer bit 
me Double extended precision 
Exponent Significand : : 
floating point 
79 63 0 


Figure 10.4 x86 Numeric Data Formats 


e Packed quadword and packed qaudword integer: Two 64-bit quadwords packed 
into a 128-bit double quadword, interpreted as a bit field or as an integer 


e Packed single-precision floating-point and packed double-precision floating- 
point: Four 32-bit floating-point or two 64-bit floating-point values packed 
into a 128-bit double quadword 


ARM Data Types 


ARM processors support data types of 8 (byte), 16 (halfword), and 32 (word) bits in 
length. Normally, halfword access should be halfword aligned and word accesses 
should be word aligned. For nonaligned access attempts, the architecture supports 
three alternatives. 


e Default case: 


— The address is treated as truncated, with address bits[1:0] treated as zero for 
word accesses, and address bit[0] treated as zero for halfword accesses. 
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— Load single word ARM instructions are architecturally defined to rotate 
right the word-aligned data transferred by a non word-aligned address one, 
two, or three bytes depending on the value of the two least significant 
address bits. 


e Alignment checking: When the appropriate control bit is set, a data abort signal 
indicates an alignment fault for attempting unaligned access. 


e Unaligned access: When this option is enabled, the processor uses one or more 
memory accesses to generate the required transfer of adjacent bytes transpar- 
ently to the programmer 


For all three data types (byte, halfword, and word) an unsigned interpretation 
is supported, in which the value represents an unsigned, nonnegative integer. All 
three data types can also be used for twos complement signed integers. 

The majority of ARM processor implementations do not provide floating- 
point hardware, which saves power and area. If floating-point arithmetic is required 
in such processors, it must be implemented in software. ARM does support an op- 
tional floating-point coprocessor that supports the single- and double-precision 
floating point data types defined in IEEE 754. 


ENDIAN SUPPORT A state bit (E-bit) in the system control register is set and 
cleared under program control using the SETEND instruction. The E-bit defines 
which endian to load and store data. Figure 10.5 illustrates the functionality asso- 
ciated with the E-bit for a word load or store operation. This mechanism enables 
efficient dynamic data load/store for system designers who know they need to ac- 
cess data structures in the opposite endianness to their OS/environment. Note 
that the address of each data byte is fixed in memory. However, the byte lane in a 
register is different. 


Data bytes 
in memory 
(ascending address values 
from byte 0 to byte 3) 


Byte 3 


Byte 2 


Byte 1 


Byte 0 


31 0 31 0 
Byte3 | Byte2 | Bytel | ByteO Byte 0 | Bytel | Byte2 | Byte3 


ARM register ARM register 


Program status register E-bit = 0 Program status register E-bit = 1 


Figure 10.5 ARM Endian Support— Word Load/Store with E-Bit 
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10.4 TYPES OF OPERATIONS 


The number of different opcodes varies widely from machine to machine. However, 
the same general types of operations are found on all machines. A useful and typical 
categorization is the following: 


e Data transfer 

e Arithmetic 

e Logical 

e Conversion 

e lO 

e System control 

e Transfer of control 


Table 10.3 (based on [HAYE98]]) lists common instruction types in each cate- 
gory. This section provides a brief survey of these various types of operations, to- 
gether with a brief discussion of the actions taken by the processor to execute a 
particular type of operation (summarized in Table 10.4). The latter topic is exam- 
ined in more detail in Chapter 12. 


Table 10.3 Common Instruction Set Operations 


Operation Name Description 


Move (transfer) Transfer word or block from source to 
destination 


Store Transfer word from processor to memory 


Load (fetch) Transfer word from memory to processor 


Data Transfer Exchange Swap contents of source and destination 


Clear (reset) Transfer word of Os to destination 


Set Transfer word of 1s to destination 


Push Transfer word from source to top of stack 


Pop Transfer word from top of stack to 
destination 


Add Compute sum of two operands 


Subtract Compute difference of two operands 


Multiply Compute product of two operands 


Divide Compute quotient of two operands 


Arithmetic = 
Absolute Replace operand by its absolute value 


Negate Change sign of operand 


Increment Add 1 to operand 


Decrement Subtract 1 from operand 


(continued ) 


Continued 


Operation Name 
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Description 


Logical 


AND 


Perform logical AND 


OR 


Perform logical OR 


NOT (complement) 


Perform logical NOT 


Exclusive-OR 


Perform logical XOR 


Test 


Test specified condition; set flag(s) based on 
outcome 


Compare 


Make logical or arithmetic comparison of two or 
more operands; set flag(s) based on outcome 


Set Control Variables 


Class of instructions to set controls for protection 
purposes, interrupt handling, timer control, etc. 


Shift 


Left (right) shift operand, introducing constants 
at end 


Rotate 


Left (right) shift operand, with wraparound end 


Transfer of Control 


Jump (branch) 


Unconditional transfer; load PC with specified 
address 


Jump Conditional 


Test specified condition; either load PC with 
specified address or do nothing, based on 
condition 


Jump to Subroutine 


Place current program control information in 
known location; jump to specified address 


Return 


Replace contents of PC and other register from 
known location 


Execute 


Fetch operand from specified location and 
execute as instruction; do not modify PC 


Skip 


Increment PC to skip next instruction 


Skip Conditional 


Test specified condition; either skip or do nothing 
based on condition 


Halt 


Stop program execution 


Wait (hold) 


Stop program execution; test specified condition 
repeatedly; resume execution when condition is 
satisfied 


No operation 


No operation is performed, but program 
execution is continued 


Input/Output 


Input (read) 


Transfer data from specified I/O port or device 
to destination (e.g., main memory or processor 
register) 


Output (write) 


Start I/O 


Transfer data from specified source to I/O port 
or device 


Transfer instructions to I/O processor to initiate 
I/O operation 


Test I/O 


Transfer status information from I/O system 
to specified destination 


Conversion 


Translate 


Translate values in a section of memory based on 
a table of correspondences 


Convert 


Convert the contents of a word from one form 
to another (e.g., packed decimal to binary) 
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Table 10.4 Processor Actions for Various Types of Operations 


Transfer data from one location to another 


If memory is involved: 


Data Transfer Determine memory address 

Perform virtual-to-actual-memory address transformation 
Check cache 

Initiate memory read/write 


May involve data transfer, before and/or after 


Arithmetic Perform function in ALU 


Set condition codes and flags 


Logical Same as arithmetic 


Similar to arithmetic and logical. May involve special logic to perform 


Conversion conversion 


Update program counter. For subroutine call/return, manage parameter 
passing and linkage 


Transfer of Control 


Issue command to I/O module 
VO 


If memory-mapped I/O, determine memory-mapped address 


Data Transfer 


The most fundamental type of machine instruction is the data transfer instruc- 
tion. The data transfer instruction must specify several things. First, the location 
of the source and destination operands must be specified. Each location could be 
memory, a register, or the top of the stack. Second, the length of data to be trans- 
ferred must be indicated. Third, as with all instructions with operands, the mode 
of addressing for each operand must be specified. This latter point is discussed in 
Chapter 11. 

The choice of data transfer instructions to include in an instruction set exem- 
plifies the kinds of trade-offs the designer must make. For example, the general lo- 
cation (memory or register) of an operand can be indicated in either the specification 
of the opcode or the operand. Table 10.5 shows examples of the most common IBM 
EAS/390 data transfer instructions. Note that there are variants to indicate the 
amount of data to be transferred (8, 16, 32, or 64 bits). Also, there are different in- 
structions for register to register, register to memory, memory to register, and mem- 
ory to memory transfers. In contrast, the VAX has a move (MOV) instruction with 
variants for different amounts of data to be moved, but it specifies whether an 
operand is register or memory as part of the operand. The VAX approach is some- 
what easier for the programmer, who has fewer mnemonics to deal with. However, 
it is also somewhat less compact than the IBM EAS/390 approach because the loca- 
tion (register versus memory) of each operand must be specified separately in the 
instruction. We will return to this distinction when we discuss instruction formats, in 
the next chapter. 

In terms of processor action, data transfer operations are perhaps the simplest 
type. If both source and destination are registers, then the processor simply causes 
data to be transferred from one register to another; this is an operation internal to 
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Table 10.5 Examples of IBM EAS/390 Data Transfer Operations 


Operation Number of Bits 
Mnemonic Transferred Description 


32 Transfer from memory to register 


Load Halfword Transfer from memory to register 


Load Transfer from register to register 


Load (Short) Transfer from floating-point register to 
floating-point register 


Load (Short) Transfer from memory to floating-point 
register 


Load (Long) Transfer from floating-point register to 
floating-point register 


Load (Long) Transfer from memory to floating-point 
register 


Store Transfer from register to memory 


Store Halfword Transfer from register to memory 


Store Character Transfer from register to memory 


Store (Short) Transfer from floating-point register to 
memory 


Store (Long) Transfer from floating-point register to 
memory 


the processor. If one or both operands are in memory, then the processor must per- 
form some or all of the following actions: 


1. Calculate the memory address, based on the address mode (discussed in 
Chapter 11). 


2. If the address refers to virtual memory, translate from virtual to real memory 
address. 


3. Determine whether the addressed item is in cache. 
4. If not, issue a command to the memory module. 


Arithmetic 


Most machines provide the basic arithmetic operations of add, subtract, multiply, 
and divide. These are invariably provided for signed integer (fixed-point) numbers. 
Often they are also provided for floating-point and packed decimal numbers. 
Other possible operations include a variety of single-operand instructions; for 

example, 

e Absolute: Take the absolute value of the operand. 

e Negate: Negate the operand. 

e Increment: Add 1 to the operand. 


e Decrement: Subtract 1 from the operand. 
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Table 10.6 Basic Logical Operations 


The execution of an arithmetic instruction may involve data transfer opera- 
tions to position operands for input to the ALU, and to deliver the output of the 
ALU. Figure 3.5 illustrates the movements involved in both data transfer and arith- 
metic operations. In addition, of course, the ALU portion of the processor performs 
the desired operation. 


Logical 


Most machines also provide a variety of operations for manipulating individual bits 
of a word or other addressable units, often referred to as “bit twiddling.” They are 
based upon Boolean operations (see Chapter 20). 

Some of the basic logical operations that can be performed on Boolean or bi- 
nary data are shown in Table 10.6. The NOT operation inverts a bit. AND, OR, and 
Exclusive-OR (XOR) are the most common logical functions with two operands. 
EQUAL is a useful binary test. 

These logical operations can be applied bitwise to n-bit logical data units. Thus, 
if two registers contain the data 


(R1) = 10100101 
(R2) = 00001111 


then 
(R1) AND (R2) = 00000101 


where the notation (X) means the contents of location X. Thus, the AND operation 
can be used as a mask that selects certain bits in a word and zeros out the remaining 
bits. As another example, if two registers contain 


(R1) = 10100101 
(R2) = 11111111 


then 
(R1) XOR (R2) = 01011010 


With one word set to all 1s, the XOR operation inverts all of the bits in the other 
word (ones complement). 

In addition to bitwise logical operations, most machines provide a variety of shift- 
ing and rotating functions. The most basic operations are illustrated in Figure 10.6. With 
a logical shift, the bits of a word are shifted left or right. On one end, the bit shifted out 
is lost. On the other end, a 0 is shifted in. Logical shifts are useful primarily for isolating 
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(a) Logical right shift 
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(b) Logical left shift 
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(c) Arithmetic right shift 


S eee 


(d) Arithmetic left shift 


——, EVAN 


(e) Right rotate 
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(f) Left rotate 


Figure 10.6 Shift and Rotate Operations 


fields within a word. The Os that are shifted into a word displace unwanted information 
that is shifted off the other end. 

As an example, suppose we wish to transmit characters of data to an I/O de- 
vice 1 character at a time. If each memory word is 16 bits in length and contains two 
characters, we must unpack the characters before they can be sent. To send the two 
characters in a word, 


1. Load the word into a register. 


2. Shift to the right eight times. This shifts the remaining character to the right 
half of the register. 


3. Perform I/O. The I/O module reads the lower-order 8 bits from the data bus. 
The preceding steps result in sending the left-hand character. To send the right- 
hand character, 

1. Load the word again into the register. 

2. AND with 0000000011111111. This masks out the character on the left. 

3. Perform I/O. 
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The arithmetic shift operation treats the data as a signed integer and does not 
shift the sign bit. On a right arithmetic shift, the sign bit is replicated into the bit po- 
sition to its right. On a left arithmetic shift, a logical left shift is performed on all bits 
but the sign bit, which is retained. These operations can speed up certain arithmetic 
operations. With numbers in twos complement notation, a right arithmetic shift cor- 
responds to a division by 2, with truncation for odd numbers. Both an arithmetic left 
shift and a logical left shift correspond to a multiplication by 2 when there is no 
overflow. If overflow occurs, arithmetic and logical left shift operations produce dif- 
ferent results, but the arithmetic left shift retains the sign of the number. Because of 
the potential for overflow, many processors do not include this instruction, including 
PowerPC and Itanium. Others, such as the IBM EAS/390, do offer the instruction. 
Curiously, the x86 architecture includes an arithmetic left shift but defines it to be 
identical to a logical left shift. 

Rotate, or cyclic shift, operations preserve all of the bits being operated on. 
One use of a rotate is to bring each bit successively into the leftmost bit, where it can 
be identified by testing the sign of the data (treated as a number). 

As with arithmetic operations, logical operations involve ALU activity and 
may involve data transfer operations. Table 10.7 gives examples of all of the shift 
and rotate operations discussed in this subsection. 


Conversion 


Conversion instructions are those that change the format or operate on the format 
of data. An example is converting from decimal to binary. An example of a more 
complex editing instruction is the EAS/390 Translate (TR) instruction. This instruc- 
tion can be used to convert from one 8-bit code to another, and it takes three 
operands: 


TR RI (L), R2 


The operand R2 contains the address of the start of a table of 8-bit codes. The L 
bytes starting at the address specified in R1 are translated, each byte being replaced 
by the contents of a table entry indexed by that byte. For example, to translate from 
EBCDIC to IRA, we first create a 256-byte table in storage locations, say, 1000- 
10FF hexadecimal. The table contains the characters of the IRA code in the se- 
quence of the binary representation of the EBCDIC code; that is, the IRA code is 
placed in the table at the relative location equal to the binary value of the EBCDIC 


Table 10.7 Examples of Shift and Rotate Operations 


Operation 


10100110 Logical right shift (3 bits) 00010100 
10100110 Logical left shift (3 bits) 00110000 
10100110 Arithmetic right shift (3 bits) 11110100 
10100110 Arithmetic left shift (3 bits) 10110000 
10100110 Right rotate (3 bits) 11010100 
10100110 Left rotate (3 bits) 00110101 
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code of the same character. Thus, locations 10F0 through 10F9 will contain the val- 
ues 30 through 39, because FO is the EBCDIC code for the digit 0, and 30 is the IRA 
code for the digit 0, and so on through digit 9. Now suppose we have the EBCDIC 
for the digits 1984 starting at location 2100 and we wish to translate to IRA. Assume 
the following: 


e Locations 2100-2103 contain F1 F9 F8 F4. 
e R1 contains 2100. 
e R2 contains 1000. 


Then, if we execute 
TR R1 (4), R2 
locations 2100-2103 will contain 31 39 38 34. 


Input/Output 


Input/output instructions were discussed in some detail in Chapter 7. As we saw, 
there are a variety of approaches taken, including isolated programmed I/O, memory- 
mapped programmed I/O, DMA, and the use of an I/O processor. Many implemen- 
tations provide only a few I/O instructions, with the specific actions specified by 
parameters, codes, or command words. 


System Control 


System control instructions are those that can be executed only while the processor 
is in a certain privileged state or is executing a program in a special privileged area 
of memory. Typically, these instructions are reserved for the use of the operating 
system. 

Some examples of system control operations are as follows. A system control 
instruction may read or alter a control register; we discuss control registers in Chap- 
ter 12. Another example is an instruction to read or modify a storage protection key, 
such as is used in the EAS/390 memory system. Another example is access to 
process control blocks in a multiprogramming system. 


Transfer of Control 


For all of the operation types discussed so far, the next instruction to be performed 
is the one that immediately follows, in memory, the current instruction. However, a 
significant fraction of the instructions in any program have as their function chang- 
ing the sequence of instruction execution. For these instructions, the operation per- 
formed by the processor is to update the program counter to contain the address of 
some instruction in memory. 

There are a number of reasons why transfer-of-control operations are required. 
Among the most important are the following: 


1. In the practical use of computers, it is essential to be able to execute each in- 
struction more than once and perhaps many thousands of times. It may require 
thousands or perhaps millions of instructions to implement an application. 


370 CHAPTER 10 / INSTRUCTION SETS: CHARACTERISTICS AND FUNCTIONS 


This would be unthinkable if each instruction had to be written out separately. 
Ifa table or a list of items is to be processed, a program loop is needed. One se- 
quence of instructions is executed repeatedly to process all the data. 


nN 


Virtually all programs involve some decision making. We would like the com- 
puter to do one thing if one condition holds, and another thing if another con- 
dition holds. For example, a sequence of instructions computes the square root 
of a number. At the start of the sequence, the sign of the number is tested. If 
the number is negative, the computation is not performed, but an error condi- 
tion is reported. 


3. To compose correctly a large or even medium-size computer program is an ex- 
ceedingly difficult task. It helps if there are mechanisms for breaking the task 
up into smaller pieces that can be worked on one at a time. 


We now turn to a discussion of the most common transfer-of-control opera- 
tions found in instruction sets: branch, skip, and procedure call. 


BRANCH INSTRUCTIONS A branch instruction, also called a jump instruction, has 
as one of its operands the address of the next instruction to be executed. Most often, 
the instruction is a conditional branch instruction. That is, the branch is made (up- 
date program counter to equal address specified in operand) only if a certain condi- 
tion is met. Otherwise, the next instruction in sequence is executed (increment 
program counter as usual). A branch instruction in which the branch is always taken 
is an unconditional branch. 

There are two common ways of generating the condition to be tested in a con- 
ditional branch instruction. First, most machines provide a 1-bit or multiple-bit con- 
dition code that is set as the result of some operations. This code can be thought of 
as a short user-visible register. As an example, an arithmetic operation (ADD, 
SUBTRACT, and so on) could set a 2-bit condition code with one of the following 
four values: 0, positive, negative, overflow. On such a machine, there could be four 
different conditional branch instructions: 


BRP X_ Branch to location X if result is positive. 
BRN X_ Branch to location X if result is negative. 
BRZX_ Branch to location X if result is zero. 
BRO X_ Branch to location X if overflow occurs. 


In all of these cases, the result referred to is the result of the most recent oper- 
ation that set the condition code. 

Another approach that can be used with a three-address instruction format is 
to perform a comparison and specify a branch in the same instruction. For example, 


BRE R1, R2,X Branch to X if contents of Rl = contents of R2. 


Figure 10.7 shows examples of these operations. Note that a branch can be ei- 
ther forward (an instruction with a higher address) or backward (lower address). 
The example shows how an unconditional and a conditional branch can be used to 
create a repeating loop of instructions. The instructions in locations 202 through 210 
will be executed repeatedly until the result of subtracting Y from X is 0. 
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address Instruction 
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225 BRE RI, R2, 235 
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e : branch 
235 e 


Figure 10.7 Branch Instructions 


SKIP INSTRUCTIONS Another form of transfer-of-control instruction is the skip in- 
struction. The skip instruction includes an implied address. Typically, the skip implies 
that one instruction be skipped; thus, the implied address equals the address of the 
next instruction plus one instruction length. 

Because the skip instruction does not require a destination address field, it is 
free to do other things. A typical example is the increment-and-skip-if-zero (ISZ) 
instruction. Consider the following program fragment: 


301 
309 ISZ RI 
310 BR 301 
311 


In this fragment, the two transfer-of-control instructions are used to implement an 
iterative loop. R1 is set with the negative of the number of iterations to be per- 
formed. At the end of the loop, R1 is incremented. If it is not 0, the program branches 
back to the beginning of the loop. Otherwise, the branch is skipped, and the program 
continues with the next instruction after the end of the loop. 


PROCEDURE CALL INSTRUCTIONS Perhaps the most important innovation in the 
development of programming languages is the procedure. A procedure is a self- 
contained computer program that is incorporated into a larger program. At any 
point in the program the procedure may be invoked, or called. The processor is in- 
structed to go and execute the entire procedure and then return to the point from 
which the call took place. 

The two principal reasons for the use of procedures are economy and modu- 
larity. A procedure allows the same piece of code to be used many times. This is im- 
portant for economy in programming effort and for making the most efficient use of 
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Figure 10.8 Nested Procedures 


storage space in the system (the program must be stored). Procedures also allow 
large programming tasks to be subdivided into smaller units. This use of modularity 
greatly eases the programming task. 

The procedure mechanism involves two basic instructions: a call instruction 
that branches from the present location to the procedure, and a return instruction 
that returns from the procedure to the place from which it was called. Both of these 
are forms of branching instructions. 

Figure 10.8a illustrates the use of procedures to construct a program. In this ex- 
ample, there is a main program starting at location 4000. This program includes a call 
to procedure PROCI, starting at location 4500. When this call instruction is encoun- 
tered, the processor suspends execution of the main program and begins execution of 
PROC1 by fetching the next instruction from location 4500. Within PROC1, there 
are two calls to PROC2 at location 4800. In each case, the execution of PROC1 is sus- 
pended and PROC72 is executed. The RETURN statement causes the processor to go 
back to the calling program and continue execution at the instruction after the corre- 
sponding CALL instruction. This behavior is illustrated in Figure 10.8b. 

Three points are worth noting: 


1. A procedure can be called from more than one location. 


2. A procedure call can appear in a procedure. This allows the nesting of proce- 
dures to an arbitrary depth. 


3. Each procedure call is matched by a return in the called program. 
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Because we would like to be able to call a procedure from a variety of points, 
the processor must somehow save the return address so that the return can take 
place appropriately. There are three common places for storing the return address: 


e Register 
e Start of called procedure 
e Top of stack 


Consider a machine-language instruction CALL X, which stands for call procedure 
at location X. If the register approach is used, CALL X causes the following actions: 


RN <— PC +A 
PC <— X 


where RN is a register that is always used for this purpose, PC is the program 
counter, and A is the instruction length. The called procedure can now save the con- 
tents of RN to be used for the later return. 

A second possibility is to store the return address at the start of the procedure. 
In this case, CALL X causes 


X << PC+A 
PO — XT 


This is quite handy. The return address has been stored safely away. 

Both of the preceding approaches work and have been used. The only limita- 
tion of these approaches is that they complicate the use of reentrant procedures. A 
reentrant procedure is one in which it is possible to have several calls open to it at 
the same time. A recursive procedure (one that calls itself) is an example of the use 
of this feature (see Appendix H). If parameters are passed via registers or memory 
for a reentrant procedure, some code must be responsible for saving the parameters 
so that the registers or memory space are available for other procedure calls. 

A more general and powerful approach is to use a stack (see Appendix 10A 
for a discussion of stacks). When the processor executes a call, it places the return 
address on the stack. When it executes a return, it uses the address on the stack. 
Figure 10.9 illustrates the use of the stack. 

In addition to providing a return address, it is also often necessary to pass pa- 
rameters with a procedure call. These can be passed in registers. Another possibility 
is to store the parameters in memory just after the CALL instruction. In this case, 


4601 4651 
4101 4101 4101 4101 4101 
e e e e e e e 
(a) Initial stack (b) After (c) Initial (d) After (e) After (f) After (g) After 
contents CALL Procl CALL Proc2 RETURN CALL Proc2 RETURN RETURN 


Figure 10.9 Use of Stack to Implement Nested Subroutines of Figure 10.8 
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Figure 10.10 Stack Frame Growth Using Sample Procedures P and Q 


the return must be to the location following the parameters. Again, both of these ap- 
proaches have drawbacks. If registers are used, the called program and the calling 
program must be written to assure that the registers are used properly. The storing 
of parameters in memory makes it difficult to exchange a variable number of para- 
meters. Both approaches prevent the use of reentrant procedures. 

A more flexible approach to parameter passing is the stack. When the proces- 
sor executes a call, it not only stacks the return address, it stacks parameters to be 
passed to the called procedure. The called procedure can access the parameters 
from the stack. Upon return, return parameters can also be placed on the stack. The 
entire set of parameters, including return address, that is stored for a procedure in- 
vocation is referred to as a stack frame. 

An example is provided in Figure 10.10. The example refers to procedure P in 
which the local variables x1 and x2 are declared, and procedure Q, which P can call 
and in which the local variables yl and y2 are declared. In this figure, the return 
point for each procedure is the first item stored in the corresponding stack frame. 
Next is stored a pointer to the beginning of the previous frame. This is needed if the 
number or length of parameters to be stacked is variable. 
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x86 Operation Types 


The x86 provides a complex array of operation types, including a number of special- 
ized instructions. The intent was to provide tools for the compiler writer to produce 
optimized machine language translation of high-level language programs. Table 10.8 
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x86 Operation Types (with Examples of Typical Operations) 


Instruction 


Description 


Data Movement 


Move operand, between registers or between register and memory. 


Push operand onto stack. 


Push all registers on stack. 


Move byte, word, dword, sign extended. Moves a byte to a word or a word to a 
doubleword with twos-complement sign extension. 


Load effective address. Loads the offset of the source operand, rather than its value to 
the destination operand. 


Table lookup translation. Replaces a byte in AL with a byte from a user-coded 
translation table. When XLAT is executed, AL should have an unsigned index 
to the table. XLAT changes the contents of AL from the table index to the table 
entry. 


IN, OUT 


Input, output operand from I/O space. 


Arithmetic 
Add operands. 


Subtract operands. 


Unsigned integer multiplication, with byte, word, or double word operands, and word, 
doubleword, or quadword result. 


Signed divide. 


Logical 


BTS 


AND operands. 


Bit test and set. Operates on a bit field operand. The instruction copies the current 
value of a bit to flag CF and sets the original bit to 1. 


BSF 


Bit scan forward. Scans a word or doubleword for a 1-bit and stores the number of the 
first 1-bit into a register. 


SHL/SHR 


Shift logical left or right. 


SAL/SAR 
ROL/ROR 


Shift arithmetic left or right. 
Rotate left or right. 


SETcc 


Sets a byte to zero or one depending on any of the 16 conditions defined by status 
flags. 


Control Transfer 


JMP 


Unconditional jump. 


CALL 


Transfer control to another location. Before transfer, the address of the instruction 
following the CALL is placed on the stack. 


JE/IZ 


Jump if equal/zero. 


LOOPE/LOOPZ 


Loops if equal/zero. This is a conditional jump using a value stored in register 
ECX. The instruction first decrements ECX before testing ECX for the branch 
condition. 


INT/INTO Interrupt/Interrupt if overflow. Transfer control to an interrupt service routine. 


(continued ) 
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Continued 


Instruction Description 


String Operations 


Move byte, word, dword string. The instruction operates on one element of a 
string, indexed by registers ESI and EDI. After each string operation, the 
registers are automatically incremented or decremented to point to the next 
element of the string. 


Load byte, word, dword of string. 


High-Level Language Support 


Creates a stack frame that can be used to implement the rules of a block-structured 
high-level language. 


Reverses the action of the previous ENTER. 


Check array bounds. Verifies that the value in operand 1 is within lower and upper 
limits. The limits are in two adjacent memory locations referenced by operand 2. An 
interrupt occurs if the value is out of bounds. This instruction is used to check an 
array index. 


Flag Control 


Set Carry flag. 
Load AH register from flags. Copies SF, ZF, AF, PF, and CF bits into A register. 


Segment Register 


Load pointer into DS and another register. 
System Control 
Halt. 


Asserts a hold on shared memory so that the Pentium has exclusive use of it during 
the instruction that immediately follows the LOCK. 


Processor extension escape. An escape code that indicates the succeeding instructions 
are to be executed by a numeric coprocessor that supports high-precision integer and 
floating-point calculations. 


Wait until BUS Y# negated. Suspends Pentium program execution until the processor 
detects that the BUSY pin is inactive, indicating that the numeric coprocessor has 
finished execution. 


Protection 


SGDT Store global descriptor table. 
LSL Load segment limit. Loads a user-specified register with a segment limit. 
VERR/VERW Verify segment for reading/writing. 


Cache Management 


INVD Flushes the internal cache memory. 


WBINVD Flushes the internal cache memory after writing dirty lines to memory. 
INVLPG Invalidates a translation lookaside buffer (TLB) entry. 
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lists the types and gives examples of each. Most of these are the conventional in- 
structions found in most machine instruction sets, but several types of instructions 
are tailored to the x86 architecture and are of particular interest. Appendix A of 
[CARTO06] lists the x86 instructions, together with the operands for each and the ef- 
fect of the instruction on the condition codes. Appendix B of the NASM assembly 
language manual provides a more detailed description of each x86 instruction. Both 
documents are available at this book’s Web site. 


CALL/RETURN INSTRUCTIONS The x86 provides four instructions to support pro- 
cedure call/return: CALL, ENTER, LEAVE, RETURN. It will be instructive to 
look at the support provided by these instructions. Recall from Figure 10.10 that a 
common means of implementing the procedure call/return mechanism is via the use 
of stack frames. When a new procedure is called, the following must be performed 
upon entry to the new procedure: 


e Push the return point on the stack. 

e Push the current frame pointer on the stack. 

e Copy the stack pointer as the new value of the frame pointer. 

e Adjust the stack pointer to allocate a frame. 
The CALL instruction pushes the current instruction pointer value onto the stack 
and causes a jump to the entry point of the procedure by placing the address of the 


entry point in the instruction pointer. In the 8088 and 8086 machines, the typical 
procedure began with the sequence 


PUSH EBP 
MOV EBP, ESP 
SUB ESP, space_for_locals 


where EBP is the frame pointer and ESP is the stack pointer. In the 80286 and later 
machines, the ENTER instruction performs all the aforementioned operations in a 
single instruction. 

The ENTER instruction was added to the instruction set to provide direct sup- 
port for the compiler. The instruction also includes a feature for support of what are 
called nested procedures in languages such as Pascal, COBOL, and Ada (not found 
in C or FORTRAN). It turns out that there are better ways of handling nested pro- 
cedure calls for these languages. Furthermore, although the ENTER instruction 
saves a few bytes of memory compared with the PUSH, MOV, SUB sequence 
(4 bytes versus 6 bytes), it actually takes longer to execute (10 clock cycles versus 
6 clock cycles). Thus, although it may have seemed a good idea to the instruction set 
designers to add this feature, it complicates the implementation of the processor 
while providing little or no benefit. We will see that, in contrast, a RISC approach to 
processor design would avoid complex instructions such as ENTER and might pro- 
duce a more efficient implementation with a sequence of simpler instructions. 


MEMORY MANAGEMENT Another set of specialized instructions deals with memory 
segmentation. These are privileged instructions that can only be executed from the op- 
erating system. They allow local and global segment tables (called descriptor tables) to 
be loaded and read, and for the privilege level of a segment to be checked and altered. 
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Table 10.9 x86 Status Flags 


Status Bit Name Description 


Carry Indicates carrying or borrowing out of the left-most bit position following an 
arithmetic operation. Also modified by some of the shift and rotate operations. 


Parity Parity of the least-significant byte of the result of an arithmetic or logic oper- 
ation. | indicates even parity; 0 indicates odd parity. 


Auxiliary Carry | Represents carrying or borrowing between half-bytes of an 8-bit arithmetic or| 
logic operation. Used in binary-coded decimal arithmetic. 


Zero Indicates that the result of an arithmetic or logic operation is 0. 


Sign Indicates the sign of the result of an arithmetic or logic operation. 


Overflow Indicates an arithmetic overflow after an addition or subtraction for twos 
complement arithmetic. 


The special instructions for dealing with the on-chip cache were discussed in 
Chapter 4. 


STATUS FLAGS AND CONDITION CODES Status flags are bits in special registers that 
may be set by certain operations and used in conditional branch instructions. The term 
condition code refers to the settings of one or more status flags. In the x86 and many 
other architectures, status flags are set by arithmetic and compare operations. The 
compare operation in most languages subtracts two operands, as does a subtract oper- 
ation. The difference is that a compare operation only sets status flags, whereas a sub- 
tract operation also stores the result of the subtraction in the destination operand. 
Some architectures also set status flags for data transfer instructions. 

Table 10.9 lists the status flags used on the x86. Each flag, or combinations of these 
flags, can be tested for a conditional jump. Table 10.10 shows the condition codes (com- 
binations of status flag values) for which conditional jump opcodes have been defined. 

Several interesting observations can be made about this list. First, we may wish 
to test two operands to determine if one number is bigger than another. But this will 
depend on whether the numbers are signed or unsigned. For example, the 8-bit 
number 11111111 is bigger than 00000000 if the two numbers are interpreted as un- 
signed integers (255 > 0) but is less if they are considered as 8-bit twos complement 
numbers (—1 < 0). Many assembly languages therefore introduce two sets of terms 
to distinguish the two cases: If we are comparing two numbers as signed integers, we 
use the terms Jess than and greater than; if we are comparing them as unsigned inte- 
gers, we use the terms below and above. 

A second observation concerns the complexity of comparing signed integers. 
A signed result is greater than or equal to zero if (1) the sign bit is zero and there is 
no overflow (S = 0 AND O = 0), or (2) the sign bit is one and there is an overflow. 
A study of Figure 9.4 should convince you that the conditions tested for the various 
signed operations are appropriate. 


X86 SIMD INSTRUCTIONS In 1996, Intel introduced MMX technology into its Pen- 
tium product line. MMX is set of highly optimized instructions for multimedia tasks. 
There are 57 new instructions that treat data in a SIMD (single-instruction, multiple- 
data) fashion, which makes it possible to perform the same operation, such as addition 
or multiplication, on multiple data elements at once. Each instruction typically takes a 
single clock cycle to execute. For the proper application, these fast parallel operations 


Table 10.10 


Symbol 
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x86 Condition Codes for Conditional Jump and SETcc Instructions 


Condition Tested 


Comment 


A, NBE 


C=0 AND Z=0 


Above; Not below or equal (greater than, unsigned) 


AE, NB, NC 


C=0 


Above or equal; Not below (greater than or equal, 
unsigned); Not carry 


B, NAE, C 


Below; Not above or equal (less than, unsigned); 
Carry set 


Below or equal; Not above (less than or equal, 
unsigned) 


Z=1 


Equal; Zero (signed or unsigned) 


[(S=1 AND O=1) OR (S=0 
and O=0)] AND [Z=0] 


Greater than; Not less than or equal (signed) 


(S=1 AND O=1) OR (S=0 
AND O=0) 


Greater than or equal; Not less than (signed) 


(S=1 AND O=0) OR (S=0 
AND O=1) 


Less than; Not greater than or equal (signed) 


(S=1 AND O=0) OR (S=0 
AND O=1) OR (Z=1) 


Less than or equal; Not greater than (signed) 


Z=0 


Not equal; Not zero (signed or unsigned) 


O=0 


No overflow 


S=0 Not sign (not negative) 


P=0 Not parity; Parity odd 


Overflow 


Parity; Parity even 


Sign (negative) 


can yield a speedup of two to eight times over comparable algorithms that do not use 
the MMX instructions [ATKI96]. With the introduction of 64-bit x86 architecture, 
Intel has expanded this extension to include double quadword (128 bits) operands 
and floating-point operations. In this subsection, we describe the MMX features. 

The focus of MMX is multimedia programming. Video and audio data are typ- 
ically composed of large arrays of small data types, such as 8 or 16 bits, whereas con- 
ventional instructions are tailored to operate on 32- or 64-bit data. Here are some 
examples: In graphics and video, a single scene consists of an array of pixels,” and 
there are 8 bits for each pixel or 8 bits for each pixel color component (red, green, 
blue). Typical audio samples are quantized using 16 bits. For some 3D graphics algo- 
rithms, 32 bits are common for basic data types. To provide for parallel operation on 
these data lengths, three new data types are defined in MMX. Each data type is 
64 bits in length and consists of multiple smaller data fields, each of which holds a 
fixed-point integer. The types are as follows: 


e Packed byte: Eight bytes packed into one 64-bit quantity 
e Packed word: Four 16-bit words packed into 64 bits 
e Packed doubleword: Two 32-bit doublewords packed into 64 bits 


2A pixel, or picture element, is the smallest element of a digital image that can be assigned a gray level. 
Equivalently, a pixel is an individual dot in a dot-matrix representation of a picture. 


Category 


MM<xX< Instruction Set 


Instruction 


Description 


Arithmetic 


Comparison 


PADD [B, W, D] 


Parallel add of packed eight bytes, four 16-bit words, or two 32-bit 
doublewords, with wraparound. 


PADDS [B, W] 


Add with saturation. 


PADDUS [B, W] 
PSUB [B, W, D] 


Add unsigned with saturation. 


Subtract with wraparound. 


PSUBS [B, W] 


Subtract with saturation. 


PSUBUS [B, W] 


Subtract unsigned with saturation. 


PMULHW 


Parallel multiply of four signed 16-bit words, with high-order 16 bits 
of 32-bit result chosen. 


PMULLW 


Parallel multiply of four signed 16-bit words, with low-order 16 bits 
of 32-bit result chosen. 


PMADDWD 


PCMPEQ [B, W, D] 


Parallel multiply of four signed 16-bit words; add together adjacent 
pairs of 32-bit results. 


Parallel compare for equality; result is mask of 1s if true or Os if 
false. 


PCMPGT [B, W, D] 


Parallel compare for greater than; result is mask of 1s if true or Os 
if false. 


Conversion 


PACKUSWB 


Pack words into bytes with unsigned saturation. 


PACKSS [WB, DW] 


Pack words into bytes, or doublewords into words, with signed 
saturation. 


PUNPCKH [BW, 
WD, DQ] 


Parallel unpack (interleaved merge) high-order bytes, words, or 
doublewords from MMxX register. 


PUNPCKL [BW, 
WD, DQ] 


Parallel unpack (interleaved merge) low-order bytes, words, or 
doublewords from MMxX register. 


Logical 


PAND 


64-bit bitwise logical AND 


PNDN 


64-bit bitwise logical AND NOT 


POR 


64-bit bitwise logical OR 


PXOR 
PSLL [W, D, Q] 


64-bit bitwise logical XOR 


Parallel logical left shift of packed words, doublewords, or 
quadword by amount specified in MMX register or immediate 
value. 


PSRL [W, D, Q] 


Parallel logical right shift of packed words, doublewords, or 
quadword. 


PSRA [W, D] 


Parallel arithmetic right shift of packed words, doublewords, or 
quadword. 


Data Transfer 


MOV [D, Q] 


Move doubleword or quadword to/from MMX register. 


State Mgt 


EMMS 


Empty MMxX< state (empty FP registers tag bits). 


Note: If an instruction supports multiple data types [byte (B), word (W), doubleword (D), quadword (Q)], the 
data types are indicated in brackets. 


Table 10.11 lists the MMX instruction set. Most of the instructions involve parallel 
operation on bytes, words, or doublewords. For example, the PSLLW instruction per- 
forms a left logical shift separately on each of the four words in the packed word 
operand; the PADDB instruction takes packed byte operands as input and performs 
parallel additions on each byte position independently to produce a packed byte output. 


10.5 / INTEL x86 AND ARM OPERATION TYPES 381 


One unusual feature of the new instruction set is the introduction of saturation 
arithmetic for byte and 16-bit word operands. With ordinary unsigned arithmetic, 
when an operation overflows (i.e., a carry out of the most significant bit), the extra 
bit is truncated. This is referred to as wraparound, because the effect of the trunca- 
tion can be, for example, to produce an addition result that is smaller than the two 
input operands. Consider the addition of the two words, in hexadecimal, FO00h and 
3000h. The sum would be expressed as 


FOOOh = 1111 0000 0000 0000 
+3000h = 0011 0000 0000 0000 
10010 0000 0000 0000 = 2000h 


If the two numbers represented image intensity, then the result of the addition 
is to make the combination of two dark shades turn out to be lighter. This is typi- 
cally not what is intended. With saturation arithmetic, if addition results in over- 
flow or subtraction results in underflow, the result is set to the largest or smallest 
value representable. For the preceding example, with saturation arithmetic, we 
have 


FOOOh = 1111 0000 0000 0000 
+3000h = 0011 0000 0000 0000 
10010 0000 0000 0000 

1111 1111 1111 1111 = FFFFh 


To provide a feel for the use of MMX instructions, we look at an example, 
taken from [PELE97]. A common video application is the fade-out, fade-in effect, in 
which one scene gradually dissolves into another. Two images are combined with a 
weighted average: 


Result_pixel = A_pixel X fade + B_pixel X (1 — fade) 


This calculation is performed on each pixel position in A and B. If a series of 
video frames is produced while gradually changing the fade value from 1 to 0 
(scaled appropriately for an 8-bit integer), the result is to fade from image A to 
image B. 

Figure 10.11 shows the sequence of steps required for one set of pixels. The 
8-bit pixel components are converted to 16-bit elements to accommodate the 
MMxX 16-bit multiply capability. If these images use 640 x 480 resolution, and 
the dissolve technique uses all 255 possible values of the fade value, then the total 
number of instructions executed using MMX is 535 million. The same calcula- 
tion, performed without the MMX instructions, requires 1.4 billion instruction 
executions [INTE98]. 


ARM Operation Types 


The ARM architecture provides a large collection of operation types. The following 
are the principal categories: 


e Load and store instructions: In the ARM architecture, only load and store in- 
structions access memory locations; arithmetic and logical instructions are 
performed only on registers and immediate values encoded in the instruction. 
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Alpha Alpha 
B B 
G G 
Image A Image A 
Ar3[Ar2|Arl [ArO Br3[Br2|Br1 [BrO 
1. Unpack byte R pixel Ar3 | Ar2 | Arl [| Aro | Br3_ | Br2 | Bri | Bro 


components from 


images A and B ss es 
Subtract 


2. Subtract image B from image A 13 12 rl 10 
x x x x 
3. Multiply result by fade value fade | fade fade fade 


fade X1r3|fade X12) fade Xr1|fade X10 


4. Add image B pixels Br3 Br2 Brl BrO 


5. Pack new composit pixels 


back to bytes 
r3_ | 12 | rl | 10 
MMxX code sequence performing this operation: 
pxor mm7, mm7 3zero out mm7 
movq mm3, fad_val _ ;load fade value replicated 4 times 
movd mm0, imageA _ ;load 4 red pixel components from image A 
movd mm1, imageB _ ;load 4 red pixel components from image B 
punpckblw mm0, mm7 sunpack 4 pixels to 16 bits 
punpckblw mm1, mm7 sunpack 4 pixels to 16 bits 
psubw mm0, mm1 ;subtract image B from image A 
pmulhw mm0, mm3 smultiply the subtract result by fade values 
padddw mm0, mm1 ;add result to image B 
packuswb mm0, mm7 spack 16-bit results back to bytes 


Figure 10.11 Image Compositing on Color Plane Representation 


This limitation is characteristic of RISC design and it is explored further in 
Chapter 13. The ARM architecture supports two broad types of instruction 
that load or store the value of a single register, or a pair of registers, from or to 
memory: (1) load or store a 32-bit word or an 8-bit unsigned byte, and (2) load 
or store a 16-bit unsigned halfword, and load and sign extend a 16-bit halfword 
or an 8-bit byte 


e Branch instructions: ARM supports a branch instruction that allows a condi- 
tional branch forwards or backwards up to 32 MB. As the program counter is 
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one of the general-purpose registers (R15), a branch or jump can also be 
generated by writing a value to R15. A subroutine call can be performed by 
a variant of the standard branch instruction. As well as allowing a branch 
forward or backward up to 32 MB, the Branch with Link (BL) instruction 
preserves the address of the instruction after the branch (the return address) 
in the LR (R14). Branches are determined by a 4-bit condition field in the 
instruction. 


Data-processing instructions: This category includes logical instructions 
(AND, OR, XOR), add and subtract instructions, and test and compare 
instructions. 


Multiply instructions: The integer multiply instructions operate on word or 
halfword operands and can produce normal or long results. For example, 
there is a multiply instruction that takes two 32-bit operands and produces a 
64-bit result. 


Parallel addition and subtraction instructions: In addition to the normal data 
processing and multiply instructions, there are a set of parallel addition and 
subtraction instructions, in which portions of two operands are operated on 
in parallel. For example, ADD16 adds the top halfwords of two registers to 
form the top halfword of the result and adds the bottom halfwords of the 
same two registers to form the bottom halfword of the result. These instruc- 
tions are useful in image processing applications, similar to the x86 MMX 
instructions 


Extend instructions: There are several instructions for unpacking data by sign or 
zero extending bytes to halfwords or words, and halfwords to words. 


Status register access instructions: ARM provides the ability to read and also 
to write portions of the status register. 


CONDITION CODES The ARM architecture defines four condition flags that are 
stored in the program status register: N, Z, C, and V (Negative, Zero, Carry and 
oVerflow), with meanings essentially the same as the S, Z, C, and V flags in 
the x86 architecture. These four flags constitute a condition code in ARM. 
Table 10.12 shows the combination of conditions for which conditional execution 
is defined. 


nN 


There are two unusual aspects to the use of condition codes in ARM: 


. All instructions, not just branch instructions, include a condition code field, 


which means that virtually all instructions may be conditionally executed. 
Any combination of flag settings except 1110 or 1111 in an instruction’s con- 
dition code field signifies that the instruction will be executed only if the con- 
dition is met. 


. All data processing instructions (arithmetic, logical) include an S bit that signi- 


fies whether the instruction updates the condition flags. 


The use of conditional execution and conditional setting of the condition flags 


helps in the design of shorter programs that use less memory. On the other hand, all 
instructions include 4 bits for the condition code, so there is a trade-off in that fewer 
bits in the 32-bit instruction are available for opcode and operands. Because the 
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Table 10.12 ARM Conditions for Conditional Instruction Execution 


Condition Tested Comment 


Z=1 Equal 
Z=0 Not equal 


C=1 Carry set/unsigned higher or same 


C=0 Carry clear/unsigned lower 


N=1 Minus/negative 


N=0 Plus/positive or zero 
V=1 Overflow 

V=0 No overflow 

C=1 AND Z=0 Unsigned higher 

C=0 OR Z=1 Unsigned lower or same 


N=V Signed greater than or equal 
[(N=1 AND V=1) 
OR (N=0 AND V=0) 
N#V Signed less than 
[(N=1 AND V=0) 

OR (N=0 AND V=1)] 
(Z=0) AND (N=V) Signed greater than 
(Z=1) OR (N#V) Signed less than or equal 


_ Always (unconditional) 


_ This instruction can only be executed 
unconditionally 


ARM is a RISC design that relies heavily on register addressing, this seems to be a 
reasonable trade-off. 


10.6 RECOMMENDED READING 


The x86 instruction set is well covered by [BREY09]. The ARM instruction set is covered 
in [SLOS04] and [KNAG04]. [INTE04b] describes software considerations related to 
microprocessor Endian architecture and discusses guidelines for developing Endian- 
neutral code. 


BREY09 Brey, B. The Intel Microprocessors: 8086/8066, 80186/80188, 80286, 80386, 
80486, Pentium, Pentium Pro Processor, Pentium II, Pentium IL, Pentium 4 and 
Core2 with 64-bit Extensions. Upper Saddle River, NJ: Prentice Hall, 2009. 


INTE04b Intel Corp. Endianness White Paper. November 15, 2004. 


KNAG04 Knages, P, and Welsh, S. ARM: Assembly Language Programming. 
Bournemouth University, School of Design, Engineering, and Computing, August 
31, 2004. www.freetechbooks.com/arm-assembly-language-programming-t729.html 

SLOS04 Sloss, A.; Symes, D.; and Wright, C. ARM System Developer's Guide. San 
Francisco: Morgan Kaufmann, 2004. 
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Key Terms 


accumulator jump procedure call 
address little endian procedure return 
arithmetic shift logical shift push 

bi-endian machine instruction reentrant procedure 


big endian operand reverse Polish notation 
branch operation rotate 

conditional branch packed decimal skip 

instruction set pop stack 


Review Questions 


10.1. What are the typical elements of a machine instruction? 
10.2. What types of locations can hold source and destination operands? 
10.3. If an instruction contains four addresses, what might be the purpose of each address? 
10.4 List and briefly explain five important instruction set design issues. 
10.5 What types of operands are typical in machine instruction sets? 
10.6 What is the relationship between the IRA character code and the packed decimal 
representation? 
10.7. What is the difference between an arithmetic shift and a logical shift? 
10.8 Why are transfer of control instructions needed? 
10.9 List and briefly explain two common ways of generating the condition to be tested in 
a conditional branch instruction. 
10.10 What is meant by the term nesting of procedures? 
10.11 List three possible places for storing the return address for a procedure return. 
10.12 What is a reentrant procedure? 
10.13. What is reverse Polish notation? 
10.14 What is the difference between big endian and little endian? 


Problems 


10.1 Show in hex notation: 
a. The packed decimal format for 23 
b. The ASCII characters 23 

10.2 For each of the following packed decimal numbers, show the decimal value: 
a. 01110011 0000 1001 
b. 0101 1000 0010 
c. 0100 1010 0110 

10.3. A given microprocessor has words of 1 byte. What is the smallest and largest integer 
that can be represented in the following representations: 

Unsigned 

Sign-magnitude 

Ones complement 

Twos complement 

Unsigned packed decimal 

Signed packed decimal 


me Ro se 
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10.4 


10.5 


10.6 


10.7 


10.8 


10.9 


10.10 


Many processors provide logic for performing arithmetic on packed decimal num- 
bers. Although the rules for decimal arithmetic are similar to those for binary opera- 
tions, the decimal results may require some corrections to the individual digits if 
binary logic is used. 

Consider the decimal addition of two unsigned numbers. If each number con- 

sists of N digits, then there are 4N bits in each number. The two numbers are to be 
added using a binary adder. Suggest a simple rule for correcting the result. Perform 
addition in this fashion on the numbers 1698 and 1786. 
The tens complement of the decimal number_X is defined to be 10 — X, where N is 
the number of decimal digits in the number. Describe the use of ten’s complement 
representation to perform decimal subtraction. Illustrate the procedure by subtract- 
ing (0326),9 from (0736)o. 


Compare zero-, one-, two-, and three-address machines by writing programs to compute 
X=(A+BxXC)/(D-EXF) 


for each of the four machines. The instructions available for use are as follows: 


0 Address 1 Address 2 Address 3 Address 
PUSH M LOAD M MOVE (X< Y) MOVE (X< Y) 
POPM STORE M ADD (X< X + Y) ADD (X<Y + Z) 
ADD ADD M SUB (X< X — Y) SUB (X< Y — Z) 
SUB SUB M MUL (X< X X Y) MUL (X<Y X Z) 
MUL MULM DIV (X= X/Y) DIV (X< Y/Z) 
DIV DIVM 


Consider a hypothetical computer with an instruction set of only two n-bit instruc- 
tions. The first bit specifies the opcode, and the remaining bits specify one of the 2”~! 
n-bit words of main memory. The two instructions are as follows: 


SUBS X Subtract the contents of location X from the accumulator, and store the 
result in location X and the accumulator. 
JUMP X Place address X in the program counter. 


A word in main memory may contain either an instruction or a binary number in twos 
complement notation. Demonstrate that this instruction repertoire is reasonably 
complete by specifying how the following operations can be programmed: 

a. Data transfer: Location X to accumulator, accumulator to location X 

b. Addition: Add contents of location X to accumulator 

ec. Conditional branch 

d. Logical OR 

e. I/O Operations 

Many instruction sets contain the instruction NOOP, meaning no operation, which 
has no effect on the processor state other than incrementing the program counter. 
Suggest some uses of this instruction. 

In Section 10.4, it was stated that both an arithmetic left shift and a logical left shift 
correspond to a multiplication by 2 when there is no overflow, and if overflow occurs, 
arithmetic and logical left shift operations produce different results, but the arith- 
metic left shift retains the sign of the number. Demonstrate that these statements are 
true for 5-bit twos complement integers. 

In what way are numbers rounded using arithmetic right shift (e.g., round toward 
+00, round toward —co,toward zero, away from 0)? 


10.11 


10.12 


10.13 


10.14 


10.15 
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Suppose a stack is to be used by the processor to manage procedure calls and returns. 
Can the program counter be eliminated by using the top of the stack as a program 
counter? 

The x86 architecture includes an instruction called Decimal Adjust after Addition 
(DAA). DAA performs the following sequence of instructions: 


if ((AL AND OFH) >9) OR (AF = 1) then 
AL¢AL + 6; 
AF<¢1; 


else 
AF <0; 

endif; 

if (AL > 9FH) OR (CF = 1) then 
AL<AL + 60H; 
CF¢ 1; 


else 
CF <0; 
endif. 


“H” indicates hexadecimal. AL is an 8-bit register that holds the result of addition of 
two unsigned 8-bit integers. AF is a flag set if there is a carry from bit 3 to bit 4 in the 
result of an addition. CF is a flag set if there is a carry from bit 7 to bit 8. Explain the 
function performed by the DAA instruction. 


The x86 Compare instruction (CMP) subtracts the source operand from the destina- 

tion operand; it updates the status flags (C, P, A, Z, S, O) but does not alter either of 

the operands. The CMP instruction can be used to determine if the destination 
operand is greater than, equal to, or less than the source operand. 

a. Suppose the two operands are treated as unsigned integers. Show which status 
flags are relevant to determine the relative size of the two integer and what values 
of the flags correspond to greater than, equal to, or less than. 

b. Suppose the two operands are treated as twos complement signed integers. Show 
which status flags are relevant to determine the relative size of the two integer 
and what values of the flags correspond to greater than, equal to, or less than. 

c. The CMP instruction may be followed by a conditional Jump (Jcc) or Set Condi- 
tion (SETcc) instruction, where cc refers to one of the 16 conditions listed in Table 
10.10. Demonstrate that the conditions tested for a signed number comparison 
are correct. 


Suppose we wished to apply the x86 CMP instruction to 32-bit operands that con- 
tained numbers in a floating-point format. For correct results, what requirements 
have to be met in the following areas? 

a. The relative position of the significand, sign, and exponent fields. 

b. The representation of the value zero. 

c. The representation of the exponent. 

d. Does the IEEE format meet these requirements? Explain. 


Many microprocessor instruction sets include an instruction that tests a condition and 
sets a destination operand if the condition is true. Examples include the SETcc on the 
x86, the Scc on the Motorola MC68000, and the Scond on the National NS32000. 
a. There are a few differences among these instructions: 
e SETcc and Scc operate only on a byte, whereas Scond operates on byte, word, 
and doubleword operands. 
e SETcc and Scond set the operand to integer one if true and to zero if false. Scc 
sets the byte to all binary ones if true and all zeros if false. 
What are the relative advantages and disadvantages of these differences? 
b. None of these instructions set any of the condition code flags, and thus an explicit 
test of the result of the instruction is required to determine its value. Discuss 
whether condition codes should be set as a result of this instruction. 
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10.16 


10.17 


10.18 


10.19 


c. 


d. 


A simple IF statement such as IF a > b THEN can be implemented using a 
numerical representation method, that is, making the Boolean value manifest, as 
opposed to a flow of control method, which represents the value of a Boolean ex- 
pression by a point reached in the program. A compiler might implement IF 
a > b THEN with the following x86 code: 


SUB CX, CX ;set register CX to 0 


MOV AX,B smove contents of location B to register AX 
CMP AX,A ;compare contents of register AX and location A 
JLE TEST jump if A = B 
INC CX sadd 1 to contents of register CX 

TEST JCXZ OUT ;jump if contents of CX equal 0 

THEN OUT 


The result of (A > B) is a Boolean value held in a register and available later on, 

outside the context of the flow of code just shown. It is convenient to use register CX 

for this, because many of the branch and loop opcodes have a built-in test for CX. 
Show an alternative implementation using the SETcc instruction that saves 

memory and execution time. (Hint: No additional new x86 instructions are need- 

ed, other than the SETcc.) 

Now consider the high-level language statement: 


A: = (B > C)OR(D = F) 


A compiler might generate the following code: 


MOV EAX,B smove contents of location B to register EAX 
CMP EAX,C _ ;compare contents of register EAX and location C 
MOV BL,O ;0 represents false 
JLE N1 jump ifB = C 
MOV BL,1 ;1 represents false 
N1 MOV EAX,D 
CMP EAX,F 
MOV BH,0 
JNE N2 
MOV BH,1 
N2 OR BL, BH 


Show an alternative implementation using the SETcc instruction that saves memory 
and execution time. 


Suppose that two registers contain the following hexadecimal values: AB0890C2, 
4598EES0. What is the result of adding them using MMX instructions: 


a. 
b. 


for packed byte 
for packed word 


Assume saturation arithmetic is not used. 


Appendix 10A points out that there are no stack-oriented instructions in an instruc- 
tion set if the stack is to be used only by the processor for such purposes as procedure 
handling. How can the processor use a stack for any purpose without stack-oriented 
instructions? 


Convert the following formulas from reverse Polish to infix: 


a. 
b. 
c. 
d. 


AB+C+Dx 
AB/CD/ + 
ABCDE +X X/ 


ABCDE + F/ + G — H/ x4 


Convert the following formulas from infix to reverse Polish: 


a. 
b. 
c. 
d. 


A+B+C+D+E 
(A+B)X(C+D)+E 
(AX B)+(CXD)+E 
(A — B) x (((C — D X E)/FYG) X H 


10.20 


10.21 


10.22 


10.23 


10.24 


10.25 
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Convert the expression A + B — C to postfix notation using Dijkstra’s algorithm. 
Show the steps involved. Is the result equivalent to (A + B) — Cor A + (B — C)? 
Does it matter? 

Using the algorithm for converting infix to postfix defined in Appendix 10A, show the 
steps involved in converting the expression of Figure 10.15 into postfix. Use a presen- 
tation similar to Figure 10.17. 

Show the calculation of the expression in Figure 10.17, using a presentation similar to 
Figure 10.16. 

Redraw the little-endian layout in Figure 10.18 so that the bytes appear as numbered 
in the big-endian layout. That is, show memory in 64-bit rows, with the bytes listed left 
to right, top to bottom. 

For the following data structures, draw the big-endian and little-endian layouts, using 
the format of Figure 10.18, and comment on the results. 


a. struct { 
double i; //0x1112131415161718 
}s1; 
b. struct { 
int 1; //0x11121314 
int j; //0x15161718 
} s2; 
c. struct { 
short i; //0x1112 
short j; //0x1314 
short k; //0x1516 
short 1; //0x1718 
} s3; 


The IBM Power architecture specification does not dictate how a processor should im- 
plement little-endian mode. It specifies only the view of memory a processor must have 
when operating in little-endian mode. When converting a data structure from big endian 
to little endian, processors are free to implement a true byte-swapping mechanism or to 
use some sort of an address modification mechanism. Current Power processors are all 
default big-endian machines and use address modification to treat data as little-endian. 

Consider the structure s defined in Figure 10.18. The layout in the lower-right por- 
tion of the figure shows the structure s as seen by the processor. In fact, if structure s 
is compiled in little-endian mode, its layout in memory is shown in Figure 10.12. Ex- 
plain the mapping that is involved, describe an easy way to implement the mapping, 
and discuss the effectiveness of this approach. 


Little-endian address mapping 


Byte 
address 11 12 13 14 

00 | 00 01 02 03] 04 05 06 07 
21 22 23 24 25 26 27 28 
0s | 08 09 0A OB OC OD OE OF 
Dc eB A’ | 31 32 33 34 
10 (BROS ae ais) aes aeons 


51 52| | ‘GFE! 
1g | 18 19] 1A 1B] 1C] 1D! 1B! 1F 
61 62 63 64 


20 AD PAL, 8) || AE sy O27 


Figure 10.12 Power Architecture 
Little-Endian Structure s in Memory 
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10.26 Write a small program to determine the endianness of machine and report the results. 
Run the program on a computer available to you and turn in the output. 

10.27 The MIPS processor can be set to operate in either big-endian or little-endian mode. 
Consider the Load Byte Unsigned (LBU) instruction, which loads a byte from mem- 
ory into the low-order 8 bits of a register and fills the high-order 24 bits of the register 
with zeros. The description of LBU is given in the MIPS reference manual using a 
register-transfer language as 


mem < LoadMemory(...) 
byte + VirtualAddress, 9 
if CONDITION then 

GPR[rt] < 0**||mem3,_, X pyte .. 24-8 ™ byte 
else 

GPR[rt] < 0**||mem7 +5 X byte .. 8 X byte 
endif 


where byte refers to the two low-order bits of the effective address and mem refers 
to the value loaded from memory. In the manual, instead of the word CONDI- 
TION, one of the following two words is used: BigEndian, LittleEndian. Which 
word is used? 

10.28 Most, but not all, processors use big- or little-endian bit ordering within a byte that is 
consistent with big- or little-endian ordering of bytes within a multibyte scalar. Let us 
consider the Motorola 68030, which uses big-endian byte ordering. The documenta- 
tion of the 68030 concerning formats is confusing. The user’s manual explains that the 
bit ordering of bit fields is the opposite of bit ordering of integers. Most bit field oper- 
ations operate with one endian ordering, but a few bit field operations require the op- 
posite ordering. The following description from the user’s manual describes most of 
the bit field operations: 


A bit operand is specified by a base address that selects one byte in mem- 
ory (the base byte), and a bit number that selects the one bit in this byte. 
The most significant bit is bit seven. A bit field operand is specified by: 
(1) a base address that selects one byte in memory; (2) a bit field offset 
that indicates the leftmost (base) bit of the bit field in relation to the 
most significant bit of the base byte; and (3) a bit field width that deter- 
mines how many bits to the right of the base byte are in the bit field. The 
most significant bit of the base byte is bit field offset 0, the least signifi- 
cant bit of the base byte is bit field offset 7. 


Do these instructions use big-endian or little-endian bit ordering? 
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Stacks 


A stack is an ordered set of elements, only one of which can be accessed at a time. 
The point of access is called the top of the stack. The number of elements in the 
stack, or length of the stack, is variable. The last element in the stack is the base of 
the stack. Items may only be added to or deleted from the top of the stack. For this 
reason, a stack is also known as a pushdown lisf or a last-in-first-out (LLFO) list. 


3A better term would be place-on-top-of list because the existing elements of the list are not moved in 
memory, but a new element is added at the next available memory address. 
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SP = Stack pointer 
BP = Base pointer 


Figure 10.13 Basic Stack Operation (full/descending) 


Figure 10.13 shows the basic stack operations. We begin at some point in time 
when the stack contains some number of elements. A PUSH operation appends one 
new item to the top of the stack. A POP operation removes the top item from the 
stack. In both cases, the top of the stack moves accordingly. Binary operators, which 
require two operands (e.g., multiply, divide, add, subtract), use the top two stack 
items as operands, pop both items, and push the result back onto the stack. Unary 
operations, which require only one operand (e.g., logical NOT), use the item on the 
top of the stack. All of these operations are summarized in Table 10.13. 


Stack Implementation 


The stack is a useful structure to provide as part of a processor implementation. 
One use, discussed in Section 10.4, is to manage procedure calls and returns. 
Stacks may also be useful to the programmer. An example of this is expression 
evaluation, discussed later in this section. 

The implementation of a stack depends in part on its potential uses. If it is de- 
sired to make stack operations available to the programmer, then the instruction set 
will include stack-oriented operations, including PUSH, POP, and operations that 
use the top one or two stack elements as operands. Because all of these operations 


Table 10.13 Stack-Oriented Operations 


PUSH Append a new element on the top of the stack. 

POP Delete the top element of the stack. 

Unary operation Perform operation on top element of stack. Replace top element with result. 

Binary operation Perform operation on top two elements of stack. Delete top two elements of 
stack. Place result of operation on top of stack. 


392 CHAPTER 10 / INSTRUCTION SETS: CHARACTERISTICS AND FUNCTIONS 


Processor Main 
registers memory 
Stack 
limit 
Stack 
pointer 
n 
o 
n 
ob 
Stack Free | 
base Block 3 
reserved on 
= 
for stack - 
In use 3 
n 
o 
a 


Figure 10.14 Typical Stack Organization (full/descending) 


refer to a unique location, namely the top of the stack, the address of the operand or 
operands is implicit and need not be included in the instruction. These are the zero- 
address instructions referred to in Section 10.1. 

If the stack mechanism is to be used only by the processor, for such purposes 
as procedure handling, then there will not be explicit stack-oriented instructions in 
the instruction set. In either case, the implementation of a stack requires that there 
be some set of locations used to store the stack elements. A typical approach is illus- 
trated in Figure 10.14. A contiguous block of locations is reserved in main memory 
(or virtual memory) for the stack. Most of the time, the block is partially filled with 
stack elements and the remainder is available for stack growth. 

Three addresses are needed for proper operation, and these are often stored in 
processor registers: 


e Stack pointer (SP): Contains the address of the top of the stack. If an item is 
appended to or deleted from the stack, the pointer is incremented or decre- 
mented to contain the address of the new top of the stack. 


e Stack base: Contains the address of the bottom location in the reserved block. 
If an attempt is made to POP when the stack is empty, an error is reported. 


e Stack limit: Contains the address of the other end of the reserved block. If an 
attempt is made to PUSH when the block is fully utilized for the stack, an 
error is reported. 
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Stack implementations have two key attributes: 


e Ascending/descending: An ascending stack grows in the direction of ascending 
addresses, starting from a low address and progressing to a higher address. That is, 
an ascending stack is one in which the SP is incremented when items are pushed 
and decremented when items are pulled. A descending stack grows in the direc- 
tion of descending addresses, starting from a high address and progressing to a 
lower one. Most machines implement descending stacks as a default. 


e Full/empty: This is a misleading terminology, because is does not refer to 
whether the stack is completely full or completely empty. Rather, the SP can 
either point to the top item in the stack (full method), or the next free space on 
the stack (an empty method). For the full method, when the stack is complete- 
ly full, the SP points to the upper limit of the stack. For the empty method, 
when the stack is completely empty, the SP points to the base of the stack. 


Figure 10.13 is an example of a descending/full implementation (assuming that 
numerically lower addresses are depicted higher on the page). The ARM architecture 
allows the system programmer to specify the use of ascending or descending, empty or 
full stack operations. The x86 architecture uses a descending/empty convention. 


Expression Evaluation 


Mathematical formulas are usually expressed in what is known as infix notation. In 
this form, a binary operator appears between the operands (e.g., a + b). For com- 
plex expressions, parentheses are used to determine the order of evaluation of ex- 
pressions. For example, a + (b X c) will yield a different result than (a + b) X c. 
To minimize the use of parentheses, operations have an implied precedence. Gener- 
ally, multiplication takes precedence over addition, so that a + b X c is equivalent 
toa + (b X c). 

An alternative technique is known as reverse Polish, or postfix, notation. In 
this notation, the operator follows its two operands. For example, 


a+b becomes ab + 
a + (b X c) becomes abe X+ 
(a+b) Xc becomes a b + cX 


Note that, regardless of the complexity of an expression, no parentheses are required 
when using reverse Polish. 

The advantage of postfix notation is that an expression in this form is easily 
evaluated using a stack. An expression in postfix notation is scanned from left to 
right. For each element of the expression, the following rules are applied: 


1. If the element is a variable or constant, push it onto the stack. 
2. If the element is an operator, pop the top two items of the stack, perform the 


operation, and push the result. 


After the entire expression has been scanned, the result is on the top of the stack. 
The simplicity of this algorithm makes it a convenient one for evaluating 
expressions. Accordingly, many compilers will take an expression in a high-level 
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Stack General Registers | Single Register 
Push a Load Rl, a Load d 
Push b Subtract R1, b Multiply e 
Subtract Load R2, d Addc 
Push c Multiply R2, e Store f 
Push d Add R2,c Load a 
Push e Divide R1, R2 Subtract b 
Multiply Store R1, f Divide f 
Add Store f 
Divide 
Pop f 
Number of instructions 10 7 8 
Memory access 10 op + 6d Top +6d 8op+ 8d 


Figure 10.15 
— a-b 
c+ (d X e) 


Comparison of Three Programs to Calculate 


language, convert it to postfix notation, and then generate the machine instructions 
from that notation. Figure 10.15 shows the sequence of machine instructions for 
evaluating f = (a — b)/(c + d X e) using stack-oriented instructions. The figure 
also shows the use of one-address and two-address instructions. Note that, even 
though the stack-oriented rules were not used in the last two cases, the postfix nota- 
tion served as a guide for generating the machine instructions. The sequence of 
events for the stack program is shown in Figure 10.16. 

The process of converting an infix expression to a postfix expression is itself 
most easily accomplished using a stack. The following algorithm is due to Dijkstra 


—" d 
>| b >| c c 
| a a —> a-b a-b a-b 
| 
d > dxe 
c c —|(dxe) +c 
a-b a-b a-b _> \ —_> 
a-b 
(dxe) +c 


Figure 10.16 Use of Stack to Compute f = (a — b)/[(d X e) + c] 
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Input Output Bas > 
A+BXC+(D+E) XF empty empty 
+BXC+(D+E)XF A empty 
BXC+(D+E) XF A + 
xX C+(D+E) XF AB + 
C+(D+E) XF AB +X 
+(D+E) XF ABC +X 
(D+E) XF ABCX+ + 
D+E) XF ABCX+ +( 
+E) XF ABCX+D + ( 
E) X F ABCX+D +(+ 
)XF ABCX+DE +(+ 
x F ABCX+DE+ + 
F ABCX+DE+ +X 
empty ABCX+DE+F +X 
empty ABCX+DE+FX+ empty 


Figure 10.17 Conversion of an Expression from Infix to Postfix Notation 
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[DIJK63]. The infix expression is scanned from left to right, and the postfix expres- 
sion is developed and output during the scan. The steps are as follows: 


= Ww 


nN 


. If itis an operand, output it. 


. If it is an operator, then 


. Examine the next element in the input. 


. If it is an opening parenthesis, push it onto the stack. 


e If the top of the stack is an opening parenthesis, then push the operator. 


e If it has higher priority than the top of the stack (multiply and divide have 
higher priority than add and subtract), then push the operator. 


e Else, pop operation from stack to output, and repeat step 4. 


parenthesis is encountered. Pop and discard the opening parenthesis. 


. If there is more input, go to step 1. 


7. If there is no more input, unstack the remaining operands. 


. If it is a closing parenthesis, pop operators to the output until an opening 


Figure 10.17 illustrates the use of this algorithm. This example should give the 
reader some feel for the power of stack-based algorithms. 
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An annoying and curious phenomenon relates to how the bytes within a word and 
the bits within a byte are both referenced and represented. We look first at the prob- 
lem of byte ordering and then consider that of bits. 


Byte Ordering 


The concept of endianness was first discussed in the literature by Cohen [COHE81]. 
With respect to bytes, endianness has to do with the byte ordering of multibyte scalar 
values. The issue is best introduced with an example. Suppose we have the 32-bit 
hexadecimal value 12345678 and that it is stored in a 32-bit word in byte-addressable 
memory at byte location 184. The value consists of 4 bytes, with the least significant 
byte containing the value 78 and the most significant byte containing the value 12. 
There are two obvious ways to store this value: 


Address Value Address Value 
184} 12 184| 78 
185| 34 185| 56 
186| 56 186] 34 
187| 78 187| 12 


The mapping on the left stores the most significant byte in the lowest numerical byte 
address; this is known as big endian and is equivalent to the left-to-right order of 
writing in Western culture languages. The mapping on the right stores the least sig- 
nificant byte in the lowest numerical byte address; this is known as little endian and 
is reminiscent of the right-to-left order of arithmetic operations in arithmetic units.* 
For a given multibyte scalar value, big endian and little endian are byte-reversed 
mappings of each other. 

The concept of endianness arises when it is necessary to treat a multiple-byte 
entity as a single data item with a single address, even though it is composed of 
smaller addressable units. Some machines, such as the Intel 80x86, x86, VAX, and 
Alpha, are little-endian machines, whereas others, such as the IBM System 370/390, 
the Motorola 680x0, Sun SPARC, and most RISC machines, are big endian. This pre- 
sents problems when data are transferred from a machine of one endian type to the 
other and when a programmer attempts to manipulate individual bytes or bits with- 
in a multibyte scalar. 

The property of endianness does not extend beyond an individual data unit. In 
any machine, aggregates such as files, data structures, and arrays are composed of 
multiple data units, each with endianness. Thus, conversion of a block of memory 
from one style of endianness to the other requires knowledge of the data structure. 

Figure 10.18 illustrates how endianness determines addressing and byte order. 
The C structure at the top contains a number of data types. The memory layout in 


“The terms big endian and little endian come from Part I, Chapter 4 of Jonathan Swift’s Gulliver's Travels. 
They refer to a religious war between two groups, one that breaks eggs at the big end and the other that 
breaks eggs at the little end. 


Byte 
address 


00 


08 


10 


18 


20 
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struct { 
int a; //0x1112_1314 word 
int pad; // 
double b; //0X2122_2324_2526_2728 doubleword 
char* ey //0x3132_3334 word 
char CUT SPOR BY Cle Dt SE! Ee eG byte array 
short ey //0x5152 halfword 
int £; //0x6162_ 6364 word 
} s; 
Big-endian address mapping Little-endian address mapping Hea 
a eee eee ee yte 
11 12 13 14 11 12 13 14} address 
00 01 02 03{ 04 05 06 07 07 06 05 04] 03 02 O1 00 00 
21 22 23 24 25 26 27 28 PL 92 BS OM SPL PF 98S 
08 09 OA OB OC OD OE OF OF OE OD OC OB OA 09 08 08 
ail 82 Ry SAN ye ee DY Die Ca Bos Als) hS lie 32 eS Sane Sd) 
TOs de | aes igre 1G ised | Iie eee «= 10 
Gop Ope eer 51 52 51 52 Gu HB 
18) 19: 1A] 1B] 1¢ ID] 1B 1F IF 1E£/1D 1C/1B{ 14:19! 18] 18 
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20 e228 By PP Pil BY 20 


Figure 10.18 Example C Data Structure and its Endian Maps 


the lower left results from compilation of that structure for a big-endian machine, 
and that in the lower right for a little-endian machine. In each case, memory is de- 
picted as a series of 64-bit rows. For the big-endian case, memory typically is viewed 
left to right, top to bottom, whereas for the little-endian case, memory typically is 
viewed as right to left, top to bottom. Note that these layouts are arbitrary. Either 
scheme could use either left to right or right to left within a row; this is a matter of 
depiction, not memory assignment. In fact, in looking at programmer manuals for a 
variety of machines, a bewildering collection of depictions is to be found, even with- 
in the same manual. 
We can make several observations about this data structure: 


e Each data item has the same address in both schemes. For example, the ad- 
dress of the doubleword with hexadecimal value 2122232425262728 is 08. 


e Within any given multibyte scalar value, the ordering of bytes in the little-en- 
dian structure is the reverse of that for the big-endian structure. 


e Endianness does not affect the ordering of data items within a structure. Thus, 
the four-character word c exhibits byte reversal, but the seven-character byte 
array d does not. Hence, the address of each individual element of d is the 
same in both structures. 


The effect of endianness is perhaps more clearly demonstrated when we view 
memory as a Vertical array of bytes, as shown in Figure 10.19. 
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(b) Little endian 


Figure 10.19 Another View of 


Figure 10.18 


There is no general consensus as to which is the superior style of endianness. 


The following points favor the big-endian style: 


5 


e Character-string sorting: A big-endian processor is faster in comparing inte- 
ger-aligned character strings; the integer ALU can compare multiple bytes in 


parallel. 


¢ Decimal/IRA dumps: All values can be printed left to right without causing 


confusion. 


e Consistent order: Big-endian processors store their integers and character 


strings in the same order (most significant byte comes first). 


‘The prophet revered by both groups in the Endian Wars of Gulliver's Travels had this to say. “All true 


Believers shall break their Eggs at the convenient End.” Not much help! 
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The following points favor the little-endian style: 


e A big-endian processor has to perform addition when it converts a 32-bit inte- 
ger address to a 16-bit integer address, to use the least significant bytes. 


e It is easier to perform higher-precision arithmetic with the little-endian style; 
you don’t have to find the least-significant byte and move backward. 


The differences are minor and the choice of endian style is often more a mat- 
ter of accommodating previous machines than anything else. 

The PowerPC is a bi-endian processor that supports both big-endian and little- 
endian modes. The bi-endian architecture enables software developers to choose ei- 
ther mode when migrating operating systems and applications from other machines. 
The operating system establishes the endian mode in which processes execute. Once 
a mode is selected, all subsequent memory loads and stores are determined by the 
memory-addressing model of that mode. To support this hardware feature, 2 bits are 
maintained in the machine state register (MSR) maintained by the operating system 
as part of the process state. One bit specifies the endian mode in which the kernel 
runs; the other specifies the processor’s current operating mode. Thus, mode can be 
changed on a per-process basis. 


Bit Ordering 
In ordering the bits within a byte, we are immediately faced with two questions: 


1. Do you count the first bit as bit zero or as bit one? 


2. Do you assign the lowest bit number to the byte’s least significant bit (little en- 
dian) or to the bytes most significant bit (big endian)? 


These questions are not answered in the same way on all machines. Indeed, on 
some machines, the answers are different in different circumstances. Furthermore, 
the choice of big- or little-endian bit ordering within a byte is not always consistent 
with big- or little-endian ordering of bytes within a multibyte scalar. The program- 
mer needs to be concerned with these issues when manipulating individual bits. 

Another area of concern is when data are transmitted over a bit-serial line. 
When an individual byte is transmitted, does the system transmit the most signifi- 
cant bit first or the least significant bit first? The designer must make certain that in- 
coming bits are handled properly. For a discussion of this issue, see [JAME90]. 
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KEY POINTS 


® Anoperand reference in an instruction either contains the actual value of 
the operand (immediate) or a reference to the address of the operand. A 
wide variety of addressing modes is used in various instruction sets. These 
include direct (operand address is in address field), indirect (address field 
points to a location that contains the operand address), register, register 
indirect, and various forms of displacement, in which a register value is 
added to an address value to produce the operand address. 


® The instruction format defines the layout fields in the instruction. 
Instruction format design is a complex undertaking, including such con- 
sideration as instruction length, fixed or variable length, number of bits 
assigned to opcode and each operand reference, and how addressing 
mode is determined. 


In Chapter 10, we focused on what an instruction set does. Specifically, we examined 
the types of operands and operations that may be specified by machine instructions. 
This chapter turns to the question of how to specify the operands and operations of 
instructions. Two issues arise. First, how is the address of an operand specified, and 
second, how are the bits of an instruction organized to define the operand addresses 
and operation of that instruction? 


11.1 ADDRESSING 


The address field or fields in a typical instruction format are relatively small. We 
would like to be able to reference a large range of locations in main memory or, for 
some systems, virtual memory. To achieve this objective, a variety of addressing 
techniques has been employed. They all involve some trade-off between address 
range and/or addressing flexibility, on the one hand, and the number of memory ref- 
erences in the instruction and/or the complexity of address calculation, on the other. 
In this section, we examine the most common addressing techniques: 


e Immediate 

e Direct 

e Indirect 

e Register 

e Register indirect 
e Displacement 

e Stack 
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Figure 11.1 Addressing Modes 


These modes are illustrated in Figure 11.1. In this section, we use the following notation: 


A = contents of an address field in the instruction 
R = contents of an address field in the instruction that refers to a register 
EA = actual (effective) address of the location containing the referenced operand 


cS 
| 


= contents of memory location X or register X 


Table 11. 
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1 Basic Addressing Modes 


Mode Algorithm Principal Advantage Principal Disadvantage 


Immediate Operand = A No memory reference Limited operand magnitude 


Direct 


Indirect 


IEA = AX Simple Limited address space 
EA = (A) Large address space Multiple memory references 


Register EA=R No memory reference Limited address space 


Register indirect EA = (R) Large address space Extra memory reference 


Displacement EA = A + (R) Flexibility Complexity 


Stack 


EA = top of stack No memory reference Limited applicability 


Table 11.1 indicates the address calculation performed for each addressing mode. 

Before beginning this discussion, two comments need to be made. First, virtu- 
ally all computer architectures provide more than one of these addressing modes. 
The question arises as to how the processor can determine which address mode is 
being used in a particular instruction. Several approaches are taken. Often, different 
opcodes will use different addressing modes. Also, one or more bits in the instruc- 
tion format can be used as a mode field. The value of the mode field determines 
which addressing mode is to be used. 

The second comment concerns the interpretation of the effective address 
(EA). In a system without virtual memory, the effective address will be either a 
main memory address or a register. In a virtual memory system, the effective 
address is a virtual address or a register. The actual mapping to a physical ad- 
dress is a function of the memory management unit (MMU) and is invisible to 
the programmer. 


Immediate Addressing 


The simplest form of addressing is immediate addressing, in which the operand 
value is present in the instruction 


Operand = A 


This mode can be used to define and use constants or set initial values of 
variables. Typically, the number will be stored in twos complement form; the left- 
most bit of the operand field is used as a sign bit. When the operand is loaded 
into a data register, the sign bit is extended to the left to the full data word size. 
In some cases, the immediate binary value is interpreted as an unsigned nonneg- 
ative integer. 

The advantage of immediate addressing is that no memory reference other 
than the instruction fetch is required to obtain the operand, thus saving one memory 
or cache cycle in the instruction cycle. The disadvantage is that the size of the num- 
ber is restricted to the size of the address field, which, in most instruction sets, is 
small compared with the word length. 
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Direct Addressing 


A very simple form of addressing is direct addressing, in which the address field con- 
tains the effective address of the operand: 


EA=A 


The technique was common in earlier generations of computers but is not common 
on contemporary architectures. It requires only one memory reference and no 
special calculation. The obvious limitation is that it provides only a limited address 
space. 


Indirect Addressing 


With direct addressing, the length of the address field is usually less than the word 
length, thus limiting the address range. One solution is to have the address field 
refer to the address of a word in memory, which in turn contains a full-length 
address of the operand. This is known as indirect addressing: 


EA = (A) 


As defined earlier, the parentheses are to be interpreted as meaning contents 
of The obvious advantage of this approach is that for a word length of N, an address 
space of 2" is now available. The disadvantage is that instruction execution requires 
two memory references to fetch the operand: one to get its address and a second to 
get its value. 

Although the number of words that can be addressed is now equal to 2%, the 
number of different effective addresses that may be referenced at any one time is 
limited to 2, where K is the length of the address field. Typically, this is not a bur- 
densome restriction, and it can be an asset. In a virtual memory environment, all the 
effective address locations can be confined to page 0 of any process. Because the 
address field of an instruction is small, it will naturally produce low-numbered direct 
addresses, which would appear in page 0. (The only restriction is that the page size 
must be greater than or equal to 2%.) When a process is active, there will be repeated 
references to page 0, causing it to remain in real memory. Thus, an indirect memory 
reference will involve, at most, one page fault rather than two. 

A rarely used variant of indirect addressing is multilevel or cascaded indirect 
addressing: 


EA = (...(A)...) 


In this case, one bit of a full-word address is an indirect flag (I). If the I bit is 0, then the 
word contains the EA. If the I bit is 1, then another level of indirection is invoked. There 
does not appear to be any particular advantage to this approach, and its disadvantage is 
that three or more memory references could be required to fetch an operand. 


Register Addressing 


Register addressing is similar to direct addressing. The only difference is that the 
address field refers to a register rather than a main memory address: 


EA=R 
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To clarify, if the contents of a register address field in an instruction is 5, then 
register R5 is the intended address, and the operand value is contained in RS. Typi- 
cally, an address field that references registers will have from 3 to 5 bits, so that a 
total of from 8 to 32 general-purpose registers can be referenced. 

The advantages of register addressing are that (1) only a small address field is 
needed in the instruction, and (2) no time-consuming memory references are required. 
As was discussed in Chapter 4, the memory access time for a register internal to the 
processor is much less than that for a main memory address. The disadvantage of regis- 
ter addressing is that the address space is very limited. 

If register addressing is heavily used in an instruction set, this implies that the 
processor registers will be heavily used. Because of the severely limited number of 
registers (compared with main memory locations), their use in this fashion makes 
sense only if they are employed efficiently. If every operand is brought into a regis- 
ter from main memory, operated on once, and then returned to main memory, then 
a wasteful intermediate step has been added. If, instead, the operand in a register 
remains in use for multiple operations, then a real savings is achieved. An example is 
the intermediate result in a calculation. In particular, suppose that the algorithm for 
twos complement multiplication were to be implemented in software. The location 
labeled A in the flowchart (Figure 9.12) is referenced many times and should be 
implemented in a register rather than a main memory location. 

It is up to the programmer or compiler to decide which values should remain 
in registers and which should be stored in main memory. Most modern processors 
employ multiple general-purpose registers, placing a burden for efficient execution 
on the assembly-language programmer (e.g., compiler writer). 


Register Indirect Addressing 


Just as register addressing is analogous to direct addressing, register indirect addressing 
is analogous to indirect addressing. In both cases, the only difference is whether the 
address field refers to a memory location or a register. Thus, for register indirect address, 


EA = (R) 


The advantages and limitations of register indirect addressing are basically the same 
as for indirect addressing. In both cases, the address space limitation (limited range 
of addresses) of the address field is overcome by having that field refer to a word- 
length location containing an address. In addition, register indirect addressing uses 
one less memory reference than indirect addressing. 


Displacement Addressing 


A very powerful mode of addressing combines the capabilities of direct addressing and 
register indirect addressing. It is known by a variety of names depending on the context 
of its use, but the basic mechanism is the same. We will refer to this as displacement 
addressing: 


EA = A +(R) 


Displacement addressing requires that the instruction have two address fields, at 
least one of which is explicit. The value contained in one address field (value = A) 
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is used directly. The other address field, or an implicit reference based on opcode, 
refers to a register whose contents are added to A to produce the effective 
address. 

We will describe three of the most common uses of displacement addressing: 


e Relative addressing 
e Base-register addressing 
e Indexing 


RELATIVE ADDRESSING For relative addressing, also called PC-relative addressing, 
the implicitly referenced register is the program counter (PC). That is, the next 
instruction address is added to the address field to produce the EA. Typically, the 
address field is treated as a twos complement number for this operation. Thus, the 
effective address is a displacement relative to the address of the instruction. 
Relative addressing exploits the concept of locality that was discussed in 
Chapters 4 and 8. If most memory references are relatively near to the instruction 
being executed, then the use of relative addressing saves address bits in the instruction. 


BASE-REGISTER ADDRESSING For base-register addressing, the interpretation is 
the following: The referenced register contains a main memory address, and the ad- 
dress field contains a displacement (usually an unsigned integer representation) 
from that address. The register reference may be explicit or implicit. 

Base-register addressing also exploits the locality of memory references. It is a 
convenient means of implementing segmentation, which was discussed in Chapter 8. 
In some implementations, a single segment-base register is employed and is used 
implicitly. In others, the programmer may choose a register to hold the base address 
of a segment, and the instruction must reference it explicitly. In this latter case, if the 
length of the address field is K and the number of possible registers is N, then one 
instruction can reference any one of N areas of 2“ words. 


INDEXING For indexing, the interpretation is typically the following: The address 
field references a main memory address, and the referenced register contains a pos- 
itive displacement from that address. Note that this usage is just the opposite of the 
interpretation for base-register addressing. Of course, it is more than just a matter of 
user interpretation. Because the address field is considered to be a memory address 
in indexing, it generally contains more bits than an address field in a comparable 
base-register instruction. Also, we shall see that there are some refinements to in- 
dexing that would not be as useful in the base-register context. Nevertheless, the 
method of calculating the EA is the same for both base-register addressing and in- 
dexing, and in both cases the register reference is sometimes explicit and sometimes 
implicit (for different processor types). 

An important use of indexing is to provide an efficient mechanism for per- 
forming iterative operations. Consider, for example, a list of numbers stored starting 
at location A. Suppose that we would like to add 1 to each element on the list. We 
need to fetch each value, add 1 to it, and store it back. The sequence of effective ad- 
dresses that we need is A,A + 1,A + 2,...,up to the last location on the list. With 
indexing, this is easily done. The value A is stored in the instruction’s address field, 
and the chosen register, called an index register, is initialized to 0. After each opera- 
tion, the index register is incremented by 1. 
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Because index registers are commonly used for such iterative tasks, it is typical 
that there is a need to increment or decrement the index register after each refer- 
ence to it. Because this is such a common operation, some systems will automatically 
do this as part of the same instruction cycle. This is known as autoindexing. If certain 
registers are devoted exclusively to indexing, then autoindexing can be invoked im- 
plicitly and automatically. If general-purpose registers are used, the autoindex oper- 
ation may need to be signaled by a bit in the instruction. Autoindexing using increment 
can be depicted as follows. 


EA = A +(R) 
(R)<—(R) + 1 


In some machines, both indirect addressing and indexing are provided, and it 
is possible to employ both in the same instruction. There are two possibilities: the 
indexing is performed either before or after the indirection. 

If indexing is performed after the indirection, it is termed postindexing: 


EA = (A) + (R) 


First, the contents of the address field are used to access a memory location contain- 
ing a direct address. This address is then indexed by the register value. This tech- 
nique is useful for accessing one of a number of blocks of data of a fixed format. For 
example, it was described in Chapter 8 that the operating system needs to employ a 
process control block for each process. The operations performed are the same 
regardless of which block is being manipulated. Thus, the addresses in the instruc- 
tions that reference the block could point to a location (value = A) containing a 
variable pointer to the start of a process control block. The index register contains 
the displacement within the block. 
With preindexing, the indexing is performed before the indirection: 


EA = (A + (R)) 


An address is calculated as with simple indexing. In this case, however, the calcu- 
lated address contains not the operand, but the address of the operand. An example 
of the use of this technique is to construct a multiway branch table. At a particular 
point in a program, there may be a branch to one of a number of locations depend- 
ing on conditions. A table of addresses can be set up starting at location A. By in- 
dexing into this table, the required location can be found. 

Typically, an instruction set will not include both preindexing and postin- 
dexing. 


Stack Addressing 


The final addressing mode that we consider is stack addressing. As defined in Ap- 
pendix 9A, a stack is a linear array of locations. It is sometimes referred to as a 
pushdown list or last-in-first-out queue. The stack is a reserved block of locations. 
Items are appended to the top of the stack so that, at any given time, the block is 
partially filled. Associated with the stack is a pointer whose value is the address of 
the top of the stack. Alternatively, the top two elements of the stack may be in 
processor registers, in which case the stack pointer references the third element of 
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the stack (Figure 10.14b). The stack pointer is maintained in a register. Thus, refer- 
ences to stack locations in memory are in fact register indirect addresses. 

The stack mode of addressing is a form of implied addressing. The machine 
instructions need not include a memory reference but implicitly operate on the top 
of the stack. 


11.2 x86 AND ARM ADDRESSING MODES 


x86 Addressing Modes 


Recall from Figure 8.21 that the x86 address translation mechanism produces an 
address, called a virtual or effective address, that is an offset into a segment. The sum 
of the starting address of the segment and the effective address produces a linear ad- 
dress. If paging is being used, this linear address must pass through a page-translation 
mechanism to produce a physical address. In what follows, we ignore this last step 
because it is transparent to the instruction set and to the programmer. 

The x86 is equipped with a variety of addressing modes intended to allow the 
efficient execution of high-level languages. Figure 11.2 indicates the logic involved. 
The segment register determines the segment that is the subject of the reference. 


Segment registers 


Base register 
SS Selector 
GS Selector (oe 
FS] Sel Index register 
ES Selector 
DS Selector 
—— cs Selector 


Scale 
1, 2, 4, or 8 
Displacement ONS 
‘ ‘ (in instruction; Segment 
Descriptor registers 0, 8, or 32 bits) base 
Aeeecae Effective plies 
_, GS address . 
Access rishts 5 Linear 
Access richts address 
B . ES 
BT] hnescstonia 5 = 
B Access rights ia 
5 Limit 
B imi 
Base Address 
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Figure 11.2 x86 Addressing Mode Calculation 
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There are six segment registers; the one being used for a particular reference 
depends on the context of execution and the instruction. Each segment register 
holds an index into the segment descriptor table (Figure 8.20), which holds the 
starting address of the corresponding segments. Associated with each user-visible 
segment register is a segment descriptor register (not programmer visible), which 
records the access rights for the segment as well as the starting address and limit 
(length) of the segment. In addition, there are two registers that may be used in 
constructing an address: the base register and the index register. 

Table 11.2 lists the x86 addressing modes. Let us consider each of these in turn. 

For the immediate mode, the operand is included in the instruction. The 
operand can be a byte, word, or doubleword of data. 

For register operand mode, the operand is located in a register. For general in- 
structions, such as data transfer, arithmetic, and logical instructions, the operand can 
be one of the 32-bit general registers (EAX, EBX, ECX, EDX, ESI, EDI, ESP, EBP), 
one of the 16-bit general registers (AX, BX, CX, DX, SI, DI, SP, BP), or one of the 8- 
bit general registers (AH, BH, CH, DH, AL, BL, CL, DL). There are also some in- 
structions that reference the segment selector registers (CS, DS, ES, SS, FS, GS). 

The remaining addressing modes reference locations in memory. The memory 
location must be specified in terms of the segment containing the location and the off- 
set from the beginning of the segment. In some cases, a segment is specified explicitly; 
in others, the segment is specified by simple rules that assign a segment by default. 

In the displacement mode, the operand’s offset (the effective address of 
Figure 11.2) is contained as part of the instruction as an 8-, 16-, or 32-bit displace- 
ment. With segmentation, all addresses in instructions refer merely to an offset in a 


Table 11.2 x86 Addressing Modes 


Mode Algorithm 


Immediate Operand = A 
Register Operand LA=R 
Displacement LA = (SR) +A 
Base LA = (SR) + (B) 


Base with Displacement LA = (SR) + (B) 
Scaled Index with Displacement LA = (SR) + (I) X 
Base with Index and Displacement LA = (SR) + (B) 
Base with Scaled Index and Displacement LA = (SR) + (I) XS 
Relative LA = (PC) +A 


LA = linear address 
(X) = contents of X 
SR = segment register 
PC = program counter 


A = contents of an address field in the instruction 
R= = register 

B= base register 

I = index register 

S 


= scaling factor 
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segment. The displacement addressing mode is found on few machines because, as 
mentioned earlier, it leads to long instructions. In the case of the x86, the displace- 
ment value can be as long as 32 bits, making for a 6-byte instruction. Displacement 
addressing can be useful for referencing global variables. 

The remaining addressing modes are indirect, in the sense that the address por- 
tion of the instruction tells the processor where to look to find the address. The base 
mode specifies that one of the 8-, 16-, or 32-bit registers contains the effective 
address. This is equivalent to what we have referred to as register indirect addressing. 

In the base with displacement mode, the instruction includes a displacement to be 
added to a base register, which may be any of the general-purpose registers. Exam- 
ples of uses of this mode are as follows: 


e Used by a compiler to point to the start of a local variable area. For example, 
the base register could point to the beginning of a stack frame, which contains 
the local variables for the corresponding procedure. 


e Used to index into an array when the element size is not 1, 2, 4, or 8 bytes and 
which therefore cannot be indexed using an index register. In this case, the dis- 
placement points to the beginning of the array, and the base register holds the re- 
sults of a calculation to determine the offset to a specific element within the array. 


e Used to access a field of a record. The base register points to the beginning of 
the record, while the displacement is an offset to the field. 


In the scaled index with displacement mode, the instruction includes a displace- 
ment to be added to a register, in this case called an index register. The index regis- 
ter may be any of the general-purpose registers except the one called ESP, which is 
generally used for stack processing. In calculating the effective address, the contents 
of the index register are multiplied by a scaling factor of 1,2, 4, or 8, and then added 
to a displacement. This mode is very convenient for indexing arrays. A scaling factor 
of 2 can be used for an array of 16-bit integers. A scaling factor of 4 can be used for 
32-bit integers or floating-point numbers. Finally, a scaling factor of 8 can be used 
for an array of double-precision floating-point numbers. 

The base with index and displacement mode sums the contents of the base reg- 
ister, the index register, and a displacement to form the effective address. Again, the 
base register can be any general-purpose register and the index register can be any 
general-purpose register except ESP. As an example, this addressing mode could be 
used for accessing a local array on a stack frame. This mode can also be used to sup- 
port a two-dimensional array; in this case, the displacement points to the beginning 
of the array, and each register handles one dimension of the array. 

The based scaled index with displacement mode sums the contents of the index reg- 
ister multiplied by a scaling factor, the contents of the base register, and the displace- 
ment. This is useful if an array is stored in a stack frame; in this case, the array elements 
would be 2, 4, or 8 bytes each in length. This mode also provides efficient indexing of a 
two-dimensional array when the array elements are 2, 4, or 8 bytes in length. 

Finally, relative addressing can be used in transfer-of-control instructions. A 
displacement is added to the value of the program counter, which points to the next 
instruction. In this case, the displacement is treated as a signed byte, word, or dou- 
bleword value, and that value either increases or decreases the address in the pro- 
gram counter. 
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ARM Addressing Modes 


Typically, a RISC machine, unlike a CISC machine, uses a simple and relatively 
straightforward set of addressing modes. The ARM architecture departs somewhat 
from this tradition by providing a relatively rich set of addressing modes. These 
modes are most conveniently classified with respect to the type of instruction.' 


LOAD/STORE ADDRESSING Load and store instructions are the only instructions 
that reference memory. This is always done indirectly through a base register plus 
offset. There are three alternatives with respect to indexing (Figure 11.3): 


e Offset: For this addressing method, indexing is not used. An offset value is 
added to or subtracted from the value in the base register to form the memory 
address. As an example Figure 11.3a illustrates this method with the assembly 
language instruction STRB r0, [r1, #12]. This is the store byte instruction. 
In this case the base address is in register rl and the displacement is an imme- 
diate value of decimal 12. The resulting address (base plus offset) is the loca- 
tion where the least significant byte from 10 is to be stored. 


e Preindex: The memory address is formed in the same way as for offset ad- 
dressing. The memory address is also written back to the base register. In other 
words, the base register value is incremented or decremented by the offset 
value. Figure 11.3b illustrates this method with the assembly language instruc- 
tion STRB r0, [r1, #12]!.The exclamation point signifies preindexing. 


e Postindex: The memory address is the base register value. An offset is added to 
or subtracted from the base register value and the result is written back to the 
base register. Figure 11.3c illustrates this method with the assembly language 
instruction STRB rO, [rl], #12. 


Note that what ARM refers to as a base register acts as an index register for 
preindex and postindex addressing. The offset value can either be an immediate 
value stored in the instruction or it can be in another register. If the offset value is in 
a register, another useful feature is available: scaled register addressing. The value in 
the offset register is scaled by one of the shift operators: Logical Shift Left, Logical 
Shift Right, Arithmetic Shift Right, Rotate Right, or Rotate Right Extended (which 
includes the carry bit in the rotation). The amount of the shift is specified as an im- 
mediate value in the instruction. 


DATA PROCESSING INSTRUCTION ADDRESSING Data processing instructions use 
either register addressing of a mixture of register and immediate addressing. For 
register addressing, the value in one of the register operands may be scaled using 
one of the five shift operators defined in the preceding paragraph. 


BRANCH INSTRUCTIONS The only form of addressing for branch instructions is im- 
mediate addressing. The branch instruction contains a 24-bit value. For address cal- 
culation, this value is shifted left 2 bits, so that the address is on a word boundary. 
Thus the effective address range is +32 MB from the program counter. 


As with our discussion of x86 addressing, we ignore the translation from virtual to physical address in 
the following discussion. 
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STRB r0, [rl, #12] 
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Figure 11.3 ARM Indexing Methods 


LOAD/STORE MULTIPLE ADDRESSING Load Multiple instructions load a subset 
(possibly all) of the general-purpose registers from memory. Store Multiple in- 
structions store a subset (possibly all) of the general-purpose registers to memory. 
The list of registers for the load or store is specified in a 16-bit field in the instruc- 
tion with each bit corresponding to one of the 16 registers. Load and Store Multiple 
addressing modes produce a sequential range of memory addresses. The lowest- 
numbered register is stored at the lowest memory address and the highest- 
numbered register at the highest memory address. Four addressing modes are used 
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LDMxx r10, {r0, rl, r4} 
STMxx r10, {r0, rl, r4} 


Increment Increment Decrement Decrement 
110 after (IA) before (IB) after (DA) before (DB) 
Base register (x4) 0x218 
(v4) (v1) 0x214 
(rl) (x0) 0x210 
(x0) (v4) 0x20C 
((s21l)) (r4) 0x208 
(x0) (v1) 0x204 
(x0) 0x200 


Figure 11.4 ARM Load/Store Multiple Addressing 


(Figure 11.4): increment after, increment before, decrement after, and decrement 
before. A base register specifies a main memory address where register values are 
stored in or loaded from in ascending (increment) or descending (decrement) word 
locations. Incrementing or decrementing starts either before or after the first mem- 
ory access. 

These instructions are useful for block loads or stores, stack operations, and 
procedure exit sequences. 


11.3 INSTRUCTION FORMATS 


An instruction format defines the layout of the bits of an instruction, in terms of its 
constituent fields. An instruction format must include an opcode and, implicitly or 
explicitly, zero or more operands. Each explicit operand is referenced using one of 
the addressing modes described in Section 11.1. The format must, implicitly or 
explicitly, indicate the addressing mode for each operand. For most instruction sets, 
more than one instruction format is used. 

The design of an instruction format is a complex art, and an amazing variety of 
designs have been implemented. We examine the key design issues, looking briefly 
at some designs to illustrate points, and then we examine the x86 and ARM solu- 
tions in detail. 


Instruction Length 


The most basic design issue to be faced is the instruction format length. This deci- 
sion affects, and is affected by, memory size, memory organization, bus structure, 
processor complexity, and processor speed. This decision determines the richness 
and flexibility of the machine as seen by the assembly-language programmer. 

The most obvious trade-off here is between the desire for a powerful instruc- 
tion repertoire and a need to save space. Programmers want more opcodes, more 
operands, more addressing modes, and greater address range. More opcodes and 
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more operands make life easier for the programmer, because shorter programs can 
be written to accomplish given tasks. Similarly, more addressing modes give the pro- 
grammer greater flexibility in implementing certain functions, such as table manipu- 
lations and multiple-way branching. And, of course, with the increase in main 
memory size and the increasing use of virtual memory, programmers want to be able 
to address larger memory ranges. All of these things (opcodes, operands, addressing 
modes, address range) require bits and push in the direction of longer instruction 
lengths. But longer instruction length may be wasteful. A 64-bit instruction occupies 
twice the space of a 32-bit instruction but is probably less than twice as useful. 

Beyond this basic trade-off, there are other considerations. Either the instruc- 
tion length should be equal to the memory-transfer length (in a bus system, data-bus 
length) or one should be a multiple of the other. Otherwise, we will not get an inte- 
gral number of instructions during a fetch cycle. A related consideration is the mem- 
ory transfer rate. This rate has not kept up with increases in processor speed. 
Accordingly, memory can become a bottleneck if the processor can execute instruc- 
tions faster than it can fetch them. One solution to this problem is to use cache 
memory (see Section 4.3); another is to use shorter instructions. Thus, 16-bit instruc- 
tions can be fetched at twice the rate of 32-bit instructions but probably can be exe- 
cuted less than twice as rapidly. 

A seemingly mundane but nevertheless important feature is that the instruc- 
tion length should be a multiple of the character length, which is usually 8 bits, and 
of the length of fixed-point numbers. To see this, we need to make use of that un- 
fortunately ill-defined word, word [FRAI83]. The word length of memory is, in 
some sense, the “natural” unit of organization. The size of a word usually deter- 
mines the size of fixed-point numbers (usually the two are equal). Word size is also 
typically equal to, or at least integrally related to, the memory transfer size. Be- 
cause a common form of data is character data, we would like a word to store an 
integral number of characters. Otherwise, there are wasted bits in each word when 
storing multiple characters, or a character will have to straddle a word boundary. 
The importance of this point is such that IBM, when it introduced the System/360 
and wanted to employ 8-bit characters, made the wrenching decision to move 
from the 36-bit architecture of the scientific members of the 700/7000 series to a 
32-bit architecture. 


Allocation of Bits 


We’ve looked at some of the factors that go into deciding the length of the instruc- 
tion format. An equally difficult issue is how to allocate the bits in that format. The 
trade-offs here are complex. 

For a given instruction length, there is clearly a trade-off between the number 
of opcodes and the power of the addressing capability. More opcodes obviously 
mean more bits in the opcode field. For an instruction format of a given length, this 
reduces the number of bits available for addressing. There is one interesting refine- 
ment to this trade-off, and that is the use of variable-length opcodes. In this approach, 
there is a minimum opcode length but, for some opcodes, additional operations may 
be specified by using additional bits in the instruction. For a fixed-length instruction, 
this leaves fewer bits for addressing. Thus, this feature is used for those instructions 
that require fewer operands and/or less powerful addressing. 


bits. 
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The following interrelated factors go into determining the use of the addressing 


Number of addressing modes: Sometimes an addressing mode can be indi- 
cated implicitly. For example, certain opcodes might always call for indexing. 
In other cases, the addressing modes must be explicit, and one or more mode 
bits will be needed. 


Number of operands: We have seen that fewer addresses can make for longer, 
more awkward programs (e.g., Figure 10.3). Typical instructions on today’s 
machines provide for two operands. Each operand address in the instruction 
might require its own mode indicator, or the use of a mode indicator could be 
limited to just one of the address fields. 


Register versus memory: A machine must have registers so that data can be 
brought into the processor for processing. With a single user-visible register 
(usually called the accumulator), one operand address is implicit and con- 
sumes no instruction bits. However, single-register programming is awkward 
and requires many instructions. Even with multiple registers, only a few bits 
are needed to specify the register. The more that registers can be used for 
operand references, the fewer bits are needed. A number of studies indicate 
that a total of 8 to 32 user-visible registers is desirable [LUND77, HUCK83]. 
Most contemporary architectures have at least 32 registers. 


Number of register sets: Most contemporary machines have one set of general- 
purpose registers, with typically 32 or more registers in the set. These registers 
can be used to store data and can be used to store addresses for displacement 
addressing. Some architectures, including that of the x86, have a collection of 
two or more specialized sets (such as data and displacement). One advantage of 
this latter approach is that, for a fixed number of registers, a functional split re- 
quires fewer bits to be used in the instruction. For example, with two sets of 
eight registers, only 3 bits are required to identify a register; the opcode or 
mode register will determine which set of registers is being referenced. 


Address range: For addresses that reference memory, the range of addresses 
that can be referenced is related to the number of address bits. Because this im- 
poses a severe limitation, direct addressing is rarely used. With displacement 
addressing, the range is opened up to the length of the address register. Even 
So, it is still convenient to allow rather large displacements from the register ad- 
dress, which requires a relatively large number of address bits in the instruction. 


Address granularity: For addresses that reference memory rather than regis- 
ters, another factor is the granularity of addressing. In a system with 16- or 32-bit 
words, an address can reference a word or a byte at the designer’s choice. Byte 
addressing is convenient for character manipulation but requires, for a fixed- 
size memory, more address bits. 


Thus, the designer is faced with a host of factors to consider and balance. How 


critical the various choices are is not clear. As an example, we cite one study [CRAG79] 
that compared various instruction format approaches, including the use of a stack, 
general-purpose registers, an accumulator, and only memory-to-register approaches. 
Using a consistent set of assumptions, no significant difference in code space or exe- 
cution time was observed. 
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Let us briefly look at how two historical machine designs balance these vari- 
ous factors. 


PDP-& One of the simplest instruction designs for a general-purpose computer 
was for the PDP-8 [BELL78b]. The PDP-8 uses 12-bit instructions and operates on 
12-bit words. There is a single general-purpose register, the accumulator. 

Despite the limitations of this design, the addressing is quite flexible. Each 
memory reference consists of 7 bits plus two 1-bit modifiers. The memory is divided 
into fixed-length pages of 27 = 128 words each. Address calculation is based on ref- 
erences to page 0 or the current page (page containing this instruction) as deter- 
mined by the page bit. The second modifier bit indicates whether direct or indirect 
addressing is to be used. These two modes can be used in combination, so that an 
indirect address is a 12-bit address contained in a word of page 0 or the current 
page. In addition, 8 dedicated words on page 0 are autoindex “registers.” When an 
indirect reference is made to one of these locations, preindexing occurs. 

Figure 11.5 shows the PDP-8 instruction format. There are a 3-bit opcode and 
three types of instructions. For opcodes 0 through 5, the format is a single-address 
memory reference instruction including a page bit and an indirect bit. Thus, there 
are only six basic operations. To enlarge the group of operations, opcode 7 defines a 
register reference or microinstruction. In this format, the remaining bits are used to 
encode additional operations. In general, each bit defines a specific operation (e.g., 
clear accumulator), and these bits can be combined in a single instruction. The mi- 
croinstruction strategy was used as far back as the PDP-1 by DEC and is, in a sense, 


Memory reference instructions 


Opcode DA | ZC Displacement 
0 2 3 4+ 5 11 
Input/output instructions 
1 1 0 Device Opcode 
0 2 3 8 9 11 


oer: . Register reference instructions 
Group | microinstructions 


1 1 1 0 CLA | CLL | CMA | CML |} RAR | RAL | BSW | IAC 
0 1 2 3 4 5 6 7 8 9 10 11 
Group 2 microinstructions 
1 1 1 0 CLA | SMA | SZA | SNL | RSS | OSR | HLT 0 
0 1 2 3 4+ 5 6 7 8 9 10 11 
Group 3 microinstructions 
1 1 1 0 CLA | MQA 0 MQL 0 0 0 if 
0 1 2 3 4+ 2) 6 7 8 9 10 11 

D/l_ = Direct/Indirect address IAC = Increment ACcumulator 

Z/C = Page 0 or Current page SMA = Skip on Minus Accumulator 

CLA = Clear Accumulator SZA = Skip on Zero Accumulator 

CLL = Clear Link SNL = Skip on Nonzero Link 

CMA = CoMplement Accumulator RSS _ = Reverse Skip Sense 

CML = CoMplement Link OSR = Or with Switch Register 

RAR = Rotate Accumulator Right HLT = HaLT 

RAL = Rotate Accumulator Left MQA = Multiplier Quotient into Accumulator 

BSW = Byte SWap MQL = Multiplier Quotient Load 


Figure 11.5 PDP-8 Instruction Formats 
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a forerunner of today’s microprogrammed machines, to be discussed in Part Four. 
Opcode 6 is the I/O operation; 6 bits are used to select one of 64 devices, and 3 bits 
specify a particular I/O command. 

The PDP-8 instruction format is remarkably efficient. It supports indirect 
addressing, displacement addressing, and indexing. With the use of the opcode ex- 
tension, it supports a total of approximately 35 instructions. Given the constraints of 
a 12-bit instruction length, the designers could hardly have done better. 


PDP-10 A sharp contrast to the instruction set of the PDP-8 is that of the PDP-10. 
The PDP-10 was designed to be a large-scale time-shared system, with an emphasis on 
making the system easy to program, even if additional hardware expense was involved. 

Among the design principles employed in designing the instruction set were 
the following [BELL78c]: 


e Orthogonality: Orthogonality is a principle by which two variables are inde- 
pendent of each other. In the context of an instruction set, the term indicates 
that other elements of an instruction are independent of (not determined by) 
the opcode. The PDP-10 designers use the term to describe the fact that an ad- 
dress is always computed in the same way, independent of the opcode. This is 
in contrast to many machines, where the address mode sometimes depends im- 
plicitly on the operator being used. 


e Completeness: Each arithmetic data type (integer, fixed-point, floating-point) 
should have a complete and identical set of operations. 


e Direct addressing: Base plus displacement addressing, which places a memory or- 
ganization burden on the programmer, was avoided in favor of direct addressing. 


Each of these principles advances the main goal of ease of programming. 

The PDP-10 has a 36-bit word length and a 36-bit instruction length. The fixed 
instruction format is shown in Figure 11.6. The opcode occupies 9 bits, allowing up to 
512 operations. In fact, a total of 365 different instructions are defined. Most instruc- 
tions have two addresses, one of which is one of 16 general-purpose registers. Thus, 
this operand reference occupies 4 bits. The other operand reference starts with an 
18-bit memory address field. This can be used as an immediate operand or a mem- 
ory address. In the latter usage, both indexing and indirect addressing are allowed. 
The same general-purpose registers are also used as index registers. 

A 36-bit instruction length is true luxury. There is no need to do clever things 
to get more opcodes; a 9-bit opcode field is more than adequate. Addressing is also 
straightforward. An 18-bit address field makes direct addressing desirable. For 
memory sizes greater than 2'®, indirection is provided. For the ease of the programmer, 
indexing is provided for table manipulation and iterative programs. Also, with an 
18-bit operand field, immediate addressing becomes attractive. 


; Index 
Opcode Register | I register Memory address 
0 8 9 12 14 17 18 35 


I = indirect bit 


Figure 11.6 PDP-10 Instruction Format 
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The PDP-10 instruction set design does accomplish the objectives listed earlier 
[LUND77]. It eases the task of the programmer or compiler at the expense of an in- 
efficient utilization of space. This was a conscious choice made by the designers and 
therefore cannot be faulted as poor design. 


Variable-Length Instructions 


The examples we have looked at so far have used a single fixed instruction length, 
and we have implicitly discussed trade-offs in that context. But the designer may 
choose instead to provide a variety of instruction formats of different lengths. This 
tactic makes it easy to provide a large repertoire of opcodes, with different opcode 
lengths. Addressing can be more flexible, with various combinations of register and 
memory references plus addressing modes. With variable-length instructions, these 
many variations can be provided efficiently and compactly. 

The principal price to pay for variable-length instructions is an increase in the 
complexity of the processor. Falling hardware prices, the use of microprogramming 
(discussed in Part Four), and a general increase in understanding the principles of 
processor design have all contributed to making this a small price to pay. However, 
we will see that RISC and superscalar machines can exploit the use of fixed-length 
instructions to provide improved performance. 

The use of variable-length instructions does not remove the desirability of 
making all of the instruction lengths integrally related to the word length. Because 
the processor does not know the length of the next instruction to be fetched, a typi- 
cal strategy is to fetch a number of bytes or words equal to at least the longest pos- 
sible instruction. This means that sometimes multiple instructions are fetched. 
However, as we shall see in Chapter 12, this is a good strategy to follow in any case. 


PDP-11 The PDP-11 was designed to provide a powerful and flexible instruction 
set within the constraints of a 16-bit minicomputer [BELL70]. 

The PDP-11 employs a set of eight 16-bit general-purpose registers. Two of 
these registers have additional significance: one is used as a stack pointer for special- 
purpose stack operations, and one is used as the program counter, which contains 
the address of the next instruction. 

Figure 11.7 shows the PDP-11 instruction formats. Thirteen different formats 
are used, encompassing zero-, one-, and two-address instruction types. The opcode 
can vary from 4 to 16 bits in length. Register references are 6 bits in length. Three 
bits identify the register, and the remaining 3 bits identify the addressing mode. The 
PDP-11 is endowed with a rich set of addressing modes. One advantage of linking 
the addressing mode to the operand rather than the opcode, as is sometimes done, is 
that any addressing mode can be used with any opcode. As was mentioned, this 
independence is referred to as orthogonality. 

PDP-11 instructions are usually one word (16 bits) long. For some instructions, 
one or two memory addresses are appended, so that 32-bit and 48-bit instructions 
are part of the repertoire. This provides for further flexibility in addressing. 

The PDP-11 instruction set and addressing capability are complex. This 
increases both hardware cost and programming complexity. The advantage is that 
more efficient or compact programs can be developed. 


6Tr 


= 


10 


11 


12 


13 


Opcode} Source {| Destination} 2 Opcode R Source 3 Opcode Offet 
4 6 6 7 3 6 8 8 
Opcode FP | Destination| 5 Opcode Destination| 6 Opcode CC 
8 2 6 10 6 12 4 
Opcode | R | 8 Opcode Opcode| Source | Destination Memory address 
13 3 16 4 6 6 16 
Opcode R | Source | Memory address 
7 3 6 16 
Opcode re| Source Memory address 
8 2 6 16 
Opcode Destination Memory address 
10 6 16 
Opcode| Source | Destination Memory address 1 Memory address 2 
4 6 6 16 16 


Numbers below fields indicate bit length 


Source and destination each contain a 3-bit addressing mode field and a 3-bit register number 
FP indicates one of four floating-point registers 
R indicates one of the general-purpose registers 
CC is the condition code field 


Figure 11.7 Instruction Formats for the PDP-11 
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VAX Most architectures provide a relatively small number of fixed instruction for- 
mats. This can cause two problems for the programmer. First, addressing mode and 
opcode are not orthogonal. For example, for a given operation, one operand must 
come from a register and another from memory, or both from registers, and so on. 
Second, only a limited number of operands can be accommodated: typically up to 
two or three. Because some operations inherently require more operands, various 
strategies must be used to achieve the desired result using two or more instructions. 

To avoid these problems, two criteria were used in designing the VAX instruc- 
tion format [STRE78]: 


1. All instructions should have the “natural” number of operands. 

2. All operands should have the same generality in specification. 
The result is a highly variable instruction format. An instruction consists of a 1- or 
2-byte opcode followed by from zero to six operand specifiers, depending on the op- 


code. The minimal instruction length is 1 byte, and instructions up to 37 bytes can be 
constructed. Figure 11.8 gives a few examples. 


Hexadecimal Explanation Assembler Notation 
Format and Description 
8 bits 
2 —— 
0 5 Opcode for RSB RSB 
Return from subroutine 
D/!4 Opcode for CLRL CLRL R9 
5 9 Register R9 Clear register R9 
B 0 Opcode for MOVW MOVW 356(R4), 25(R11) 
cc | a Word displacement mode, ove a word from address 
614 Register R4 that is 356 plus contents 
t 356 in hexadecimal of R4 to address that is 
0 1 F 25 plus contents of R11 
AIlB Byte displacement mode, 
Register R11 
1 9 25 in hexadecimal 
pcode for » RO, 
Opcode for ADDL3 ADDL3 #5, RO, @A[R2] 
ort litera to a 32-bit integer in 
Short literal 5 Add 5 32-bit integer i 
5 0 Register mode RO RO and store the result in 
Ind fix R2 location whose address is 
4 2 i ee prea d relati sum of A and 4 times the 
D F ndirect word relative contents of R2 
(displacement from PC) : 
+ i Amount of displacement from 
PC relative to location A 


Figure 11.8 Example of VAX Instructions 
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The VAX instruction begins with a 1-byte opcode. This suffices to handle most 
VAX instructions. However, as there are over 300 different instructions, 8 bits are 
not enough. The hexadecimal codes FD and FF indicate an extended opcode, with 
the actual opcode being specified in the second byte. 

The remainder of the instruction consists of up to six operand specifiers. An 
operand specifier is, at minimum, a 1-byte format in which the leftmost 4 bits are 
the address mode specifier. The only exception to this rule is the literal mode, 
which is signaled by the pattern 00 in the leftmost 2 bits, leaving space for a 6-bit 
literal. Because of this exception, a total of 12 different addressing modes can 
be specified. 

An operand specifier often consists of just one byte, with the rightmost 4 bits 
specifying one of 16 general-purpose registers. The length of the operand specifier 
can be extended in one of two ways. First, a constant value of one or more bytes may 
immediately follow the first byte of the operand specifier. An example of this is the 
displacement mode, in which an 8-, 16-, or 32-bit displacement is used. Second, an 
index mode of addressing may be used. In this case, the first byte of the operand 
specifier consists of the 4-bit addressing mode code of 0100 and a 4-bit index regis- 
ter identifier. The remainder of the operand specifier consists of the base address 
specifier, which may itself be one or more bytes in length. 

The reader may be wondering, as the author did, what kind of instruction requires 
six operands. Surprisingly, the VAX has a number of such instructions. Consider 


ADDP6 OP1, OP2, OP3, OP4, OPS, OP6 


This instruction adds two packed decimal numbers. OP1 and OP2 specify the length 
and starting address of one decimal string; OP3 and OP4 specify a second string. 
These two strings are added and the result is stored in the decimal string whose 
length and starting location are specified by OP5 and OP6. 

The VAX instruction set provides for a wide variety of operations and ad- 
dressing modes. This gives a programmer, such as a compiler writer, a very powerful 
and flexible tool for developing programs. In theory, this should lead to efficient 
machine-language compilations of high-level language programs and, in general, to 
effective and efficient use of processor resources. The penalty to be paid for these 
benefits is the increased complexity of the processor compared with a processor 
with a simpler instruction set and format. 

We return to these matters in Chapter 13, where we examine the case for very 
simple instruction sets. 


11.4 x86 AND ARM INSTRUCTION FORMATS 


x86 Instruction Formats 


The x86 is equipped with a variety of instruction formats. Of the elements described 
in this subsection, only the opcode field is always present. Figure 11.9 illustrates the 
general instruction format. Instructions are made up of from zero to four optional 
instruction prefixes, a 1- or 2-byte opcode, an optional address specifier (which 
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Oorl 


Oorl Oorl Oorl bytes 


prefix 


override 


Address 
size 
override 


Operand 
size 
override 


a. 

0, 1, 2, 3, or 4 bytes ge 1, 2, or 3 Oorl Oorl 0,1, 2, or 4 0,1, 2, or 4 
Instruction prefixes Opcode ModR/m SIB Displacement Immediate 
eae \ oo SL 
germ \\ eae 

a7 Vy SSa 
Pei oe I oy Ta, 
a 1 a 
ae ae TH 
oe I \ ~e 
a i \ ha 
Mod Reg/Opcode RM Scale Index Base 
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 


Figure 11.9 x86 Instruction Format 


consists of the ModR/m byte and the Scale Index byte) an optional displacement, 
and an optional immediate field. 


Let us first consider the prefix bytes: 


Instruction prefixes: The instruction prefix, if present, consists of the LOCK 
prefix or one of the repeat prefixes. The LOCK prefix is used to ensure ex- 
clusive use of shared memory in multiprocessor environments. The repeat 
prefixes specify repeated operation of a string, which enables the x86 to 
process strings much faster than with a regular software loop. There are five 
different repeat prefixes: REP, REPE, REPZ, REPNE, and REPNZ. When 
the absolute REP prefix is present, the operation specified in the instruction 
is executed repeatedly on successive elements of the string; the number of 
repetitions is specified in register CX. The conditional REP prefix causes the 
instruction to repeat until the count in CX goes to zero or until the condition 
is met. 


Segment override: Explicitly specifies which segment register an instruction 
should use, overriding the default segment-register selection generated by the 
x86 for that instruction. 


Operand size: An instruction has a default operand size of 16 or 32 bits, and 
the operand prefix switches between 32-bit and 16-bit operands. 


Address size: The processor can address memory using either 16- or 32-bit ad- 
dresses. The address size determines the displacement size in instructions and 
the size of address offsets generated during effective address calculation. One 
of these sizes is designated as default, and the address size prefix switches 
between 32-bit and 16-bit address generation. 
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The instruction itself includes the following fields: 


e Opcode: The opcode field is 1, 2, or 3 bytes in length. The opcode may also 
include bits that specify if data is byte- or full-size (16 or 32 bits depending on 
context), direction of data operation (to or from memory), and whether an im- 
mediate data field must be sign extended. 


e ModR/m: This byte, and the next, provide addressing information. The ModR/m 
byte specifies whether an operand is in a register or in memory; if it is in mem- 
ory, then fields within the byte specify the addressing mode to be used. The 
ModR/m byte consists of three fields: The Mod field (2 bits) combines with 
the r/m field to form 32 possible values: 8 registers and 24 indexing modes; the 
Reg/Opcode field (3 bits) specifies either a register number or three more bits 
of opcode information; the r/m field (3 bits) can specify a register as the loca- 
tion of an operand, or it can form part of the addressing-mode encoding in 
combination with the Mod field. 


e SIB: Certain encoding of the ModR/m byte specifies the inclusion of the SIB 
byte to specify fully the addressing mode. The SIB byte consists of three fields: 
The Scale field (2 bits) specifies the scale factor for scaled indexing; the Index 
field (3 bits) specifies the index register; the Base field (3 bits) specifies the 
base register. 


e Displacement: When the addressing-mode specifier indicates that a dis- 
placement is used, an 8-, 16-, or 32-bit signed integer displacement field is 
added. 


e Immediate: Provides the value of an 8-, 16-, or 32-bit operand. 


Several comparisons may be useful here. In the x86 format, the addressing 
mode is provided as part of the opcode sequence rather than with each operand. 
Because only one operand can have address-mode information, only one mem- 
ory operand can be referenced in an instruction. In contrast, the VAX carries the 
address-mode information with each operand, allowing memory-to-memory 
operations. The x86 instructions are therefore more compact. However, if a 
memory-to-memory operation is required, the VAX can accomplish this in a 
single instruction. 

The x86 format allows the use of not only 1-byte, but also 2-byte and 4-byte 
offsets for indexing. Although the use of the larger index offsets results in longer 
instructions, this feature provides needed flexibility. For example, it is useful in 
addressing large arrays or large stack frames. In contrast, the IBM S/370 instruc- 
tion format allows offsets no greater than 4K bytes (12 bits of offset informa- 
tion), and the offset must be positive. When a location is not in reach of this 
offset, the compiler must generate extra code to generate the needed address. 
This problem is especially apparent in dealing with stack frames that have local 
variables occupying in excess of 4K bytes. As [DEWA90] puts it, “generating 
code for the 370 is so painful as a result of that restriction that there have even 
been compilers for the 370 that simply chose to limit the size of the stack frame to 
4K bytes.” 
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31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14131211109 8 76543210 


Data processing 


immediate shift | CO [0 0 0} Opcode |S Rn Rd Shift amount | Shift] 0 Rm 
ieee via: Cond |0 0 0} Opcode |S] Rn Rd Rs |0/Shift}1] Rm 
pee anes Cond [0 0 1] Opcode |S Rn Rd Rotate Immediate 
sed nor an [o 1 ofpjufefw[L] Rn Rd Immediate 
cepiner otter MRE Jot 1fP]upB wit] Rn Rae) Shift amount | shift] 0 |TnRm 
gerd Cond E 0 O;}PJU;/S|WIL Rn Register list 
Branchrane? ae [2 o aL 24-bit offset 


S = For data processing instructions, signifies that the instruction 
updates the condition codes 

S = For load/store multiple instructions, signifies whether instruction 
execution is restricted to supervisor mode 

P, U, W = Bits that distinguish between 
different types of addressing_mode 

B = Distinguishes between an unsigned 
byte (B==1) and a word (B==0) access 

L = For load/store instructions, distinguishes 
between a Load (L==1) and a Store (L==0) 

L = For branch instructions, determines whether a 
return address is stored in the link register 


Figure 11.10 ARM Instruction Formats 


As can be seen, the encoding of the x86 instruction set is very complex. This 
has to do partly with the need to be backward compatible with the 8086 machine 
and partly with a desire on the part of the designers to provide every possible assis- 
tance to the compiler writer in producing efficient code. It is a matter of some 
debate whether an instruction set as complex as this is preferable to the opposite 
extreme of the RISC instruction sets. 


ARM Instruction Formats 


All instructions in the ARM architecture are 32 bits long and follow a regular for- 
mat (Figure 11.10). The first four bits of an instruction are the condition code. As 
discussed in Chapter 10, virtually all ARM instructions can be conditionally exe- 
cuted. The next three bits specify the general type of instruction. For most instruc- 
tions other than branch instructions, the next five bits constitute an opcode and/or 
modifier bits for the operation. The remaining 20 bits are for operand addressing. 
The regular structure of the instruction formats eases the job of the instruction 
decode units. 


IMMEDIATE CONSTANTS To achieve a greater range of immediate values, the data 
processing immediate format specifies both an immediate value and a rotate value. 
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31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14131211109 8 76543210 
0}0}0}0)0)0)0/0/0/0/}0}0}0}0}0/0}/0}/0/0}0/0}0}0]0 


ror #0—range 0 through 0x000000FF—step 0x00000001 


31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14131211109 8 76543210 
0/0}0}0)0)0)0/0/0/0}0}0}0]0}0/0}/0}/0/0/0/}0}0}0]0 


ror #8—range 0 through 0xFF000000—step 0x01000000 


31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14131211109 8 76543210 
0}0}0}0)0}0)0/0/0/0}0}0}0}0}0/0}0/0/0/0/0}0 0}0 


ror #30—range 0 through 0x000003FC—step 0x00000004 
Figure 11.11 Examples of Use of ARM Immediate Contants 


The 8-bit immediate value is expanded to 32 bits and then rotated right by a num- 
ber of bits equal to twice the 4-bit rotate value. Several examples are shown in 
Figure 11.11. 


THUMB INSTRUCTION SET The Thumb instruction set is a re-encoded subset of the 
ARM instruction set. Thumb is designed to increase the performance of ARM im- 
plementations that use a 16-bit or narrower memory data bus and to allow better 
code density than provided by the ARM instruction set. The Thumb instruction set 
contains a subset of the ARM 32-bit instruction set recoded into 16-bit instructions. 
The savings is achieved in the following way: 


1. Thumb instructions are unconditional, so the condition code field is not used. 
Also, all Thumb arithmetic and logic instructions update the condition flags, so 
that the update-flag bit is not needed. Savings: 5 bits. 


2. Thumb has only a subset of the operations in the full instruction set and uses only 
a 2-bit opcode field, plus a 3-bit type field. Savings: 2 bits. 


3. The remaining savings of 9 bits comes from reductions in the operand specifi- 
cations. For example, Thumb instructions reference only registers rO through 
r7, so only 3 bits are required for register references, rather than 4 bits. Imme- 
diate values do not include a 4-bit rotate field. 


The ARM processor can execute a program consisting of a mixture of Thumb 
instructions and 32-bit ARM instructions. A bit in the processor control register 
determines which type of instruction is currently being executed. Figure 11.12 shows 
an example. The figure shows both the general format and a specific instance of an 
instruction in both 16-bit and 32-bit formats. 
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15 14131211109 876543210 


Add/subract/compare/move ® @ i 
immediate format 


Opcode] Rd/Rn Immediate 


ADD r3, #19 |0 0 1} 1 0/0 1 1/00010011 


Always Update 
condition condition Zero 
code flags rotation 


ADDS r3, r3, #19/1 1 10/0 01/001 0/1/0011/0011/0000j00010011 


Data processing 
immediate format 


Cond 0 0 1] Opcode |S Rn Rd Rotate Immediate 


31 30 29 28 27 26 25 24 23 22 21 20 19 1817 1615 14131211109 8 76543210 
Figure 11.12 Expanding a Thumb ADD Instruction into its ARM Equivalent 


11.5 ASSEMBLY LANGUAGE 


A processor can understand and execute machine instructions. Such instructions are 
simply binary numbers stored in the computer. If a programmer wished to program 
directly in machine language, then it would be necessary to enter the program as 
binary data. 

Consider the simple BASIC statement 


N=I1I+J+K 


Suppose we wished to program this statement in machine language and to initialize 
I, J, and K to 2, 3, and 4, respectively. This is shown in Figure 11.13a. The program 
starts in location 101 (hexadecimal). Memory is reserved for the four variables start- 
ing at location 201. The program consists of four instructions: 


1. Load the contents of location 201 into the AC. 
2. Add the contents of location 202 to the AC. 

3. Add the contents of location 203 to the AC. 

4. Store the contents of the AC in location 204. 


This is clearly a tedious and very error-prone process. 

A slight improvement is to write the program in hexadecimal rather than binary 
notation (Figure 10.11b). We could write the program as a series of lines. Each line 
contains the address of a memory location and the hexadecimal code of the binary 
value to be stored in that location. Then we need a program that will accept this input, 
translate each line into a binary number, and store it in the specified location. 

For more improvement, we can make use of the symbolic name or mnemonic 
of each instruction. This results in the symbolic program shown in Figure 10.11c. 
Each line of input still represents one memory location. Each line consists of three 
fields, separated by spaces. The first field contains the address of a location. For an 


11.5 / ASSEMBLY LANGUAGE 427 
Address Contents Address Contents 
101 0010 0010 101 2201 101 2201 
102 0001 0010 102 1202 102 1202 
103 0001 0010 103 1203 103 1203 
104 0011 0010 104 3204 104 3204 
201 0000 0000 201 0002 201 0002 
202 0000 0000 202 0003 202 0003 
203 0000 0000 203 0004 203 0004 
204 0000 0000 204 0000 204 0000 
(a) Binary program (b) Hexadecimal program 
Address Instruction Label Operation Operand 
101 LDA 201 FORMUL LDA I 
102 ADD 202 ADD J 
103 ADD 203 ADD K 
104 STA 204 STA N 
201 DAT 2; I DATA 2 
202 DAT 3 J DATA 3 
203 DAT 4 K DATA 4 
204 DAT 0 N DATA 0 
(c) Symbolic program (d) Assembly program 


Figure 11.13 Computation of the FormulaN =I+J+K 


instruction, the second field contains the three-letter symbol for the opcode. If it is a 
memory-referencing instruction, then a third field contains the address. To store 
arbitrary data in a location, we invent a pseudoinstruction with the symbol DAT. 
This is merely an indication that the third field on the line contains a hexadecimal 
number to be stored in the location specified in the first field. 

For this type of input we need a slightly more complex program. The program 
accepts each line of input, generates a binary number based on the second and third 
(if present) fields, and stores it in the location specified by the first field. 

The use of a symbolic program makes life much easier but is still awkward. 
In particular, we must give an absolute address for each word. This means that the 
program and data can be loaded into only one place in memory, and we must 
know that place ahead of time. Worse, suppose we wish to change the program 
some day by adding or deleting a line. This will change the addresses of all subse- 
quent words. 

A much better system, and one commonly used, is to use symbolic addresses. 
This is illustrated in Figure 10.11d. Each line still consists of three fields. The first 
field is still for the address, but a symbol is used instead of an absolute numerical ad- 
dress. Some lines have no address, implying that the address of that line is one more 
than the address of the previous line. For memory-reference instructions, the third 
field also contains a symbolic address. 

With this last refinement, we have an assembly language. Programs written in 
assembly language (assembly programs) are translated into machine language by an 
assembler. This program must not only do the symbolic translation discussed earlier 
but also assign some form of memory addresses to symbolic addresses. 
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The development of assembly language was a major milestone in the evolu- 
tion of computer technology. It was the first step to the high-level languages in use 
today. Although few programmers use assembly language, virtually all machines 
provide one. They are used, if at all, for systems programs such as compilers and 
1/O routines. 

Appendix B provides a more detailed examination of assembly language. 


11.6 RECOMMENDED READING 


The references cited in Chapter 10 are equally applicable to the material of this chapter. 
[BLAA97] contains a detailed discussion of instruction formats and addressing modes. In ad- 
dition, the reader may wish to consult [FLYN85] for a discussion and analysis of instruction 
set design issues, particularly those relating to formats. 


BLAA97 Blaauw, G., and Brooks, F. Computer Architecture: Concepts and Evolution. 
Reading, MA: Addison-Wesley, 1997. 


FLYN85 Flynn, M.; Johnson, J.; and Wakefield, S. “On Instruction Sets and Their 
Formats.” [EEE Transactions on Computers, March 1985. 


11.7 KEY TERMS, REVIEW QUESTIONS, AND PROBLEMS 


Key Terms 


autoindexing immediate addressing preindexing 
base-register addressing indexing register addressing 


direct addressing indirect addressing register indirect addressing 
displacement addressing instruction format relative addressing 
effective address postindexing word 


Review Questions 


11.1 Briefly define immediate addressing. 
11.2 Briefly define direct addressing. 
11.3 Briefly define indirect addressing. 
11.4 Briefly define register addressing. 
11.5 Briefly define register indirect addressing. 
11.6 Briefly define displacement addressing. 
11.7 Briefly define relative addressing. 
11.8 What is the advantage of autoindexing? 
11.9 What is the difference between postindexing and preindexing? 
11.10 What facts go into determining the use of the addressing bits of an instruction? 


11.11 What are the advantages and disadvantages of using a variable-length instruction 
format? 
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Problems 


11.1 


11.2 


11.4 


11.6 


Given the following memory values and a one-address machine with an accumulator, 
what values do the following instructions load into the accumulator? 
e Word 20 contains 40. 
e Word 30 contains 50. 
e Word 40 contains 60. 
e Word 50 contains 70. 
a. LOAD IMMEDIATE 20 
b. LOAD DIRECT 20 
ce. LOAD INDIRECT 20 
d. LOAD IMMEDIATE 30 
e. LOAD DIRECT 30 
f. LOAD INDIRECT 30 


Let the address stored in the program counter be designated by the symbol X1. The 
instruction stored in X1 has an address part (operand reference) X2. The operand 
needed to execute the instruction is stored in the memory word with address X3. An 
index register contains the value X4. What is the relationship between these various 
quantities if the addressing mode of the instruction is (a) direct; (b) indirect; (c) PC 
relative; (d) indexed? 

An address field in an instruction contains decimal value 14. Where is the corre- 
sponding operand located for 

a. immediate addressing? 

b. direct addressing? 

c. indirect addressing? 

d. register addressing? 

e. register indirect addressing? 


Consider a 16-bit processor in which the following appears in main memory, starting 
at location 200: 


200 Load to AC Mode 


201 500 


202 Next instruction 


The first part of the first word indicates that this instruction loads a value into an 
accumulator. The Mode field specifies an addressing mode and, if appropriate, indi- 
cates a source register; assume that when used, the source register is R1, which has a 
value of 400. There is also a base register that contains the value 100. The value of 500 
in location 201 may be part of the address calculation. Assume that location 399 con- 
tains the value 999, location 400 contains the value 1000, and so on. Determine the 
effective address and the operand to be loaded for the following address modes: 


a. Direct d. PC relative g. Register indirect 
b. Immediate | e. Displacement | h. Autoindexing with increment, using R1 
c. Indirect f. Register 


A PC-relative mode branch instruction is 3 bytes long. The address of the instruction, 
in decimal, is 256028. Determine the branch target address if the signed displacement 
in the instruction is —31. 

A PC-relative mode branch instruction is stored in memory at address 620,9. The 
branch is made to location 5309. The address field in the instruction is 10 bits long. 
What is the binary value in the instruction? 
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11.7 


11.8 


11.9 


11.10 


11.11 


11.12 


11.13 


11.14 


11.15 


How many times does the processor need to refer to memory when it fetches and 
executes an indirect-address-mode instruction if the instruction is (a) a computation 
requiring a single operand; (b) a branch? 
The IBM 370 does not provide indirect addressing. Assume that the address of an 
operand is in main memory. How would you access the operand? 
In [COOK82], the author proposes that the PC-relative addressing modes be elimi- 
nated in favor of other modes, such as the use of a stack. What is the disadvantage of 
this proposal? 
The x86 includes the following instruction: 

IMUL opI1, op2, immediate 


This instruction multiplies op2, which may be either register or memory, by the im- 
mediate operand value, and places the result in op1, which must be a register. There is 
no other three-operand instruction of this sort in the instruction set. What is the pos- 
sible use of such an instruction? (Hint: Consider indexing.) 


Consider a processor that includes a base with indexing addressing mode. Suppose an 
instruction is encountered that employs this addressing mode and specifies a dis- 
placement of 1970, in decimal. Currently the base and index register contain the deci- 
mal numbers 48022 and 8, respectively. What is the address of the operand? 


Define: EA = (X)-+ is the effective address equal to the contents of location X, with 
X incremented by one word length after the effective address is calculated; 
EA = —(X) is the effective address equal to the contents of location X, with X decre- 
mented by one word length before the effective address is calculated; EA = (X)-— is 
the effective address equal to the contents of location X, with X decremented by one 
word length after the effective address is calculated. Consider the following instruc- 
tions, each in the format (Operation Source Operand, Destination Operand), with the 
result of the operation placed in the destination operand. 

a. OP X, (X) 

b. OP (X), (X)+ 
c. OP (X)+, (X) 
d. OP —(X), (X) 
e. OP —(X), (X)+ 
f. OP (X)+, (Xx) 
g. OP (X)—,(X) 
Using X as the stack pointer, which of these instructions can pop the top two elements 
from the stack, perform the designated operation (e.g., ADD source to destination and 
store in destination), and push the result back on the stack? For each such instruction, 
does the stack grow toward memory location 0 or in the opposite direction? 

Assume a stack-oriented processor that includes the stack operations PUSH and 
POP. Arithmetic operations automatically involve the top one or two stack elements. 
Begin with an empty stack. What stack elements remain after the following instruc- 
tions are executed? 


PUSH 4 
PUSH 7 
PUSH 8 
ADD 
PUSH 10 
SUB 

MUL 


Justify the assertion that a 32-bit instruction is probably much less than twice as use- 
ful as a 16-bit instruction. 

Why was IBM’s decision to move from 36 bits to 32 bits per word wrenching, and to 
whom? 


11.16 


11.17 


11.18 


11.19 
11.20 


15 
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Assume an instruction set that uses a fixed 16-bit instruction length. Operand specifiers 
are 6 bits in length. There are K two-operand instructions and L zero-operand instruc- 
tions. What is the maximum number of one-operand instructions that can be supported? 


Design a variable-length opcode to allow all of the following to be encoded in a 36-bit 
instruction: 

e instructions with two 15-bit addresses and one 3-bit register number 

¢ instructions with one 15-bit address and one 3-bit register number 

° instructions with no addresses or registers 


Consider the results of Problem 10.6. Assume that M is a 16-bit memory address and 
that X, Y, and Z are either 16-bit addresses or 4-bit register numbers. The one-address 
machine uses an accumulator, and the two- and three-address machines have 16 reg- 
isters and instructions operating on all combinations of memory locations and regis- 
ters. Assuming 8-bit opcodes and instruction lengths that are multiples of 4 bits, how 
many bits does each machine need to compute X? 


Is there any possible justification for an instruction with two opcodes? 
The 16-bit Zilog Z8001 has the following general instruction format: 


14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 


Operand? Operand 


The mode field specifies how to locate the operands from the operand fields. The w/b 
field is used in certain instructions to specify whether the operands are bytes or 16-bit 
words. The operand I field may (depending on the mode field contents) specify one of 
16 general-purpose registers. The operand 2 field may specify any general-purpose 
registers except register 0. When the operand 2 field is all zeros, each of the original 
opcodes takes on a new meaning. 

a. How many opcodes are provided on the Z8001? 

b. Suggest an efficient way to provide more opcodes and indicate the trade-off 

involved. 
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KEY POINTS 


® A processor includes both user-visible registers and control/status regis- 
ters. The former may be referenced, implicitly or explicitly, in machine in- 
structions. User-visible registers may be general purpose or have a 
special use, such as fixed-point or floating-point numbers, addresses, in- 
dexes, and segment pointers. Control and status registers are used to con- 
trol the operation of the processor. One obvious example is the program 
counter. Another important example is a program status word (PSW) 
that contains a variety of status and condition bits. These include bits to 
reflect the result of the most recent arithmetic operation, interrupt en- 
able bits, and an indicator of whether the processor is executing in super- 
visor or user mode. 


® Processors make use of instruction pipelining to speed up execution. In 
essence, pipelining involves breaking up the instruction cycle into a num- 
ber of separate stages that occur in sequence, such as fetch instruction, de- 
code instruction, determine operand addresses, fetch operands, execute 
instruction, and write operand result. Instructions move through these 
stages, as on an assembly line, so that in principle, each stage can be work- 
ing on a different instruction at the same time. The occurrence of branch- 
es and dependencies between instructions complicates the design and use 
of pipelines. 


This chapter discusses aspects of the processor not yet covered in Part Three and 
sets the stage for the discussion of RISC and superscalar architecture in Chapters 13 
and 14. 

We begin with a summary of processor organization. Registers, which form the 
internal memory of the processor, are then analyzed. We are then in a position to re- 
turn to the discussion (begun in Section 3.2) of the instruction cycle. A description of 
the instruction cycle and a common technique known as instruction pipelining com- 
plete our description. The chapter concludes with an examination of some aspects of 
the x86 and ARM organizations. 


12.1 PROCESSOR ORGANIZATION 


To understand the organization of the processor, let us consider the requirements 
placed on the processor, the things that it must do: 


e Fetch instruction: The processor reads an instruction from memory (register, 
cache, main memory). 


e Interpret instruction: The instruction is decoded to determine what action is 
required. 
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e Fetch data: The execution of an instruction may require reading data from 
memory or an I/O module. 


e Process data: The execution of an instruction may require performing some 
arithmetic or logical operation on data. 


e Write data: The results of an execution may require writing data to memory or 
an I/O module. 


To do these things, it should be clear that the processor needs to store some 
data temporarily. It must remember the location of the last instruction so that it can 
know where to get the next instruction. It needs to store instructions and data tem- 
porarily while an instruction is being executed. In other words, the processor needs 
a small internal memory. 

Figure 12.1 is a simplified view of a processor, indicating its connection to the 
rest of the system via the system bus. A similar interface would be needed for any of 
the interconnection structures described in Chapter 3. The reader will recall that the 
major components of the processor are an arithmetic and logic unit (ALU) and a 
control unit (CU). The ALU does the actual computation or processing of data. The 
control unit controls the movement of data and instructions into and out of the 
processor and controls the operation of the ALU. In addition, the figure shows a 
minimal internal memory, consisting of a set of storage locations, called registers. 

Figure 12.2 is a slightly more detailed view of the processor. The data transfer 
and logic control paths are indicated, including an element labeled internal proces- 
sor bus. This element is needed to transfer data between the various registers and 
the ALU because the ALU in fact operates only on data in the internal processor 
memory. The figure also shows typical basic elements of the ALU. Note the similar- 
ity between the internal structure of the computer as a whole and the internal struc- 
ture of the processor. In both cases, there is a small collection of major elements 
(computer: processor, I/O, memory; processor: control unit, ALU, registers) connected 
by data paths. 


Registers 


ALU 


Control 
unit 


Control Data Address 


bus bus bus 
System 
bus 


Figure 12.1 The CPU with the System Bus 
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Arithmetic and logic unit -<—_—————| 
<_—_> 
~<—— Status flags i 
. Registers 


<—_——> Shifter 


[<> Complementer 


Internal CPU bus 


Arithmetic 
< > and 
Boolean 
logic 
Control 
unit 
Control 
paths 


Figure 12.2. Internal Structure of the CPU 


12.2 REGISTER ORGANIZATION 


As we discussed in Chapter 4, a computer system employs a memory hierarchy. At 
higher levels of the hierarchy, memory is faster, smaller, and more expensive (per 
bit). Within the processor, there is a set of registers that function as a level of mem- 
ory above main memory and cache in the hierarchy. The registers in the processor 
perform two roles: 


e User-visible registers: Enable the machine- or assembly language programmer 
to minimize main memory references by optimizing use of registers. 


e Control and status registers: Used by the control unit to control the operation 
of the processor and by privileged, operating system programs to control the 
execution of programs. 


There is not a clean separation of registers into these two categories. For exam- 
ple, on some machines the program counter is user visible (e.g., x86), but on many it 
is not. For purposes of the following discussion, however, we will use these categories. 


User-Visible Registers 


A user-visible register is one that may be referenced by means of the machine language 
that the processor executes. We can characterize these in the following categories: 


e General purpose 
e Data 
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e Address 
e Condition codes 


General-purpose registers can be assigned to a variety of functions by the pro- 
grammer. Sometimes their use within the instruction set is orthogonal to the opera- 
tion. That is, any general-purpose register can contain the operand for any opcode. 
This provides true general-purpose register use. Often, however, there are restrictions. 
For example, there may be dedicated registers for floating-point and stack operations. 

In some cases, general-purpose registers can be used for addressing functions 
(e.g., register indirect, displacement). In other cases, there is a partial or clean sepa- 
ration between data registers and address registers. Data registers may be used only 
to hold data and cannot be employed in the calculation of an operand address. 
Address registers may themselves be somewhat general purpose, or they may be de- 
voted to a particular addressing mode. Examples include the following: 


e Segment pointers: In a machine with segmented addressing (see Section 8.3), a 
segment register holds the address of the base of the segment. There may be 
multiple registers: for example, one for the operating system and one for the 
current process. 


e Index registers: These are used for indexed addressing and may be autoin- 
dexed. 


e Stack pointer: If there is user-visible stack addressing, then typically there is a 
dedicated register that points to the top of the stack. This allows implicit ad- 
dressing; that is, push, pop, and other stack instructions need not contain an ex- 
plicit stack operand. 


There are several design issues to be addressed here. An important issue is 
whether to use completely general-purpose registers or to specialize their use. We 
have already touched on this issue in the preceding chapter because it affects in- 
struction set design. With the use of specialized registers, it can generally be implicit 
in the opcode which type of register a certain operand specifier refers to. The 
operand specifier must only identify one of a set of specialized registers rather than 
one out of all the registers, thus saving bits. On the other hand, this specialization 
limits the programmer’s flexibility. 

Another design issue is the number of registers, either general purpose or data 
plus address, to be provided. Again, this affects instruction set design because more reg- 
isters require more operand specifier bits. As we previously discussed, somewhere be- 
tween 8 and 32 registers appears optimum [LUND77]. Fewer registers result in more 
memory references; more registers do not noticeably reduce memory references (e.g,, 
see [WILL90]). However, a new approach, which finds advantage in the use of hun- 
dreds of registers, is exhibited in some RISC systems and is discussed in Chapter 13. 

Finally, there is the issue of register length. Registers that must hold addresses 
obviously must be at least long enough to hold the largest address. Data registers 
should be able to hold values of most data types. Some machines allow two contigu- 
ous registers to be used as one for holding double-length values. 

A final category of registers, which is at least partially visible to the user, holds 
condition codes (also referred to as flags). Condition codes are bits set by the 
processor hardware as the result of operations. For example, an arithmetic operation 
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Table 12.1 Condition Codes 


Advantages Disadvantages 


1. Because condition codes are set by normal . Condition codes add complexity, both to the 
arithmetic and data movement instructions, hardware and software. Condition code bits 
they should reduce the number of COM- are often modified in different ways by 
PARE and TEST instructions needed. different instructions, making life more 


. Conditional instructions, such as BRANCH difficult for both the microprogrammer and 
are simplified relative to composite instruc- compiler writer. 
tions, such as TEST AND BRANCH. . Condition codes are irregular; they are typi- 

cally not part of the main data path, so they 

. Condition codes facilitate multiway branch- require extra hardware connections. 
es. For example, a TEST instruction can be 
followed by two branches, one on less than 
or equal to zero and one on greater than 
zero. 


. Often condition code machines must add spe- 
cial non-condition-code instructions for special 
situations anyway, such as bit checking, loop 
control, and atomic semaphore operations. 


. Ina pipelined implementation, condition 
codes require special synchronization to 
avoid conflicts. 


may produce a positive, negative, zero, or overflow result. In addition to the result it- 
self being stored in a register or memory, a condition code is also set. The code may 
subsequently be tested as part of a conditional branch operation. 

Condition code bits are collected into one or more registers. Usually, they 
form part of a control register. Generally, machine instructions allow these bits to be 
read by implicit reference, but the programmer cannot alter them. 

Many processors, including those based on the IA-64 architecture and the 
MIPS processors, do not use condition codes at all. Rather, conditional branch in- 
structions specify a comparison to be made and act on the result of the comparison, 
without storing a condition code. Table 12.1, based on [DERO87], lists key advan- 
tages and disadvantages of condition codes. 

In some machines, a subroutine call will result in the automatic saving of all 
user-visible registers, to be restored on return. The processor performs the saving 
and restoring as part of the execution of call and return instructions. This allows 
each subroutine to use the user-visible registers independently. On other machines, 
it is the responsibility of the programmer to save the contents of the relevant user- 
visible registers prior to a subroutine call, by including instructions for this purpose 
in the program. 


Control and Status Registers 


There are a variety of processor registers that are employed to control the operation 
of the processor. Most of these, on most machines, are not visible to the user. Some 
of them may be visible to machine instructions executed in a control or operating 
system mode. 

Of course, different machines will have different register organizations and 
use different terminology. We list here a reasonably complete list of register types, 
with a brief description. 
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Four registers are essential to instruction execution: 


e Program counter (PC): Contains the address of an instruction to be fetched 
e Instruction register (IR): Contains the instruction most recently fetched 
e Memory address register (MAR): Contains the address of a location in memory 


e Memory buffer register (MBR): Contains a word of data to be written to 
memory or the word most recently read 


Not all processors have internal registers designated as MAR and MBR, but 
some equivalent buffering mechanism is needed whereby the bits to be trans- 
ferred to the system bus are staged and the bits to be read from the data bus are 
temporarily stored. 

Typically, the processor updates the PC after each instruction fetch so that the 
PC always points to the next instruction to be executed. A branch or skip instruction 
will also modify the contents of the PC. The fetched instruction is loaded into an IR, 
where the opcode and operand specifiers are analyzed. Data are exchanged with 
memory using the MAR and MBR. In a bus-organized system, the MAR connects 
directly to the address bus, and the MBR connects directly to the data bus. User- 
visible registers, in turn, exchange data with the MBR. 

The four registers just mentioned are used for the movement of data between 
the processor and memory. Within the processor, data must be presented to the 
ALU for processing. The ALU may have direct access to the MBR and user-visible 
registers. Alternatively, there may be additional buffering registers at the boundary 
to the ALU; these registers serve as input and output registers for the ALU and ex- 
change data with the MBR and user-visible registers. 

Many processor designs include a register or set of registers, often known as 
the program status word (PSW), that contain status information. The PSW typically 
contains condition codes plus other status information. Common fields or flags in- 
clude the following: 


e Sign: Contains the sign bit of the result of the last arithmetic operation. 


e Zero: Set when the result is 0. 


e Carry: Set if an operation resulted in a carry (addition) into or borrow (sub- 
traction) out of a high-order bit. Used for multiword arithmetic operations. 


e Equal: Set if a logical compare result is equality. 
e Overflow: Used to indicate arithmetic overflow. 
e Interrupt Enable/Disable: Used to enable or disable interrupts. 


e Supervisor: Indicates whether the processor is executing in supervisor or 
user mode. Certain privileged instructions can be executed only in supervi- 
sor mode, and certain areas of memory can be accessed only in supervisor 
mode. 


A number of other registers related to status and control might be found in a 
particular processor design. There may be a pointer to a block of memory contain- 
ing additional status information (e.g., process control blocks). In machines using 
vectored interrupts, an interrupt vector register may be provided. If a stack is used 
to implement certain functions (e.g., subroutine call), then a system stack pointer is 


DO 
D1 
D2 
D3 
D4 
D5 
D6 
D7 


AO 
Al 
A2 
A3 
A4 
AS 
A6 
AT 


12.2 / REGISTER ORGANIZATION 439 


needed. A page table pointer is used with a virtual memory system. Finally, registers 
may be used in the control of I/O operations. 

A number of factors go into the design of the control and status register orga- 
nization. One key issue is operating system support. Certain types of control infor- 
mation are of specific utility to the operating system. If the processor designer has a 
functional understanding of the operating system to be used, then the register orga- 
nization can to some extent be tailored to the operating system. 

Another key design decision is the allocation of control information between 
registers and memory. It is common to dedicate the first (lowest) few hundred or 
thousand words of memory for control purposes. The designer must decide how 
much control information should be in registers and how much in memory. The 
usual trade-off of cost versus speed arises. 


Example Microprocessor Register Organizations 


It is instructive to examine and compare the register organization of comparable 
systems. In this section, we look at two 16-bit microprocessors that were designed at 
about the same time: the Motorola MC68000 [STRI79] and the Intel 8086 
[MORS78]. Figures 12.3a and b depict the register organization of each; purely in- 
ternal registers, such as a memory address register, are not shown. 

The MC68000 partitions its 32-bit registers into eight data registers and nine ad- 
dress registers. The eight data registers are used primarily for data manipulation and 
are also used in addressing as index registers. The width of the registers allows 8-, 16-, 
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Figure 12.3 Example Microprocessor Register Organizations 


440 CHAPTER 12 / PROCESSOR STRUCTURE AND FUNCTION 


and 32-bit data operations, determined by opcode. The address registers contain 32-bit 
(no segmentation) addresses; two of these registers are also used as stack pointers, one 
for users and one for the operating system, depending on the current execution mode. 
Both registers are numbered 7, because only one can be used at a time. The MC68000 
also includes a 32-bit program counter and a 16-bit status register. 

The Motorola team wanted a very regular instruction set, with no special- 
purpose registers. A concern for code efficiency led them to divide the registers into 
two functional components, saving one bit on each register specifier. This seems a 
reasonable compromise between complete generality and code compaction. 

The Intel 8086 takes a different approach to register organization. Every regis- 
ter is special purpose, although some registers are also usable as general purpose. 
The 8086 contains four 16-bit data registers that are addressable on a byte or 16-bit 
basis, and four 16-bit pointer and index registers. The data registers can be used as 
general purpose in some instructions. In others, the registers are used implicitly. For 
example, a multiply instruction always uses the accumulator. The four pointer regis- 
ters are also used implicitly in a number of operations; each contains a segment off- 
set. There are also four 16-bit segment registers. Three of the four segment registers 
are used in a dedicated, implicit fashion, to point to the segment of the current in- 
struction (useful for branch instructions), a segment containing data, and a segment 
containing a stack, respectively. These dedicated and implicit uses provide for com- 
pact encoding at the cost of reduced flexibility. The 8086 also includes an instruction 
pointer and a set of 1-bit status and control flags. 

The point of this comparison should be clear. There is no universally accepted 
philosophy concerning the best way to organize processor registers [TOON81]. As 
with overall instruction set design and so many other processor design issues, it is 
still a matter of judgment and taste. 

A second instructive point concerning register organization design is illustrated 
in Figure 12.3c. This figure shows the user-visible register organization for the Intel 
80386 [ELAY85], which is a 32-bit microprocessor designed as an extension of the 
8086.' The 80386 uses 32-bit registers. However, to provide upward compatibility for 
programs written on the earlier machine, the 80386 retains the original register organi- 
zation embedded in the new organization. Given this design constraint, the architects 
of the 32-bit processors had limited flexibility in designing the register organization. 


12.3 INSTRUCTION CYCLE 


In Section 3.2, we described the processor’s instruction cycle (Figure 3.9). To recall, 
an instruction cycle includes the following stages: 

e Fetch: Read the next instruction from memory into the processor. 

e Execute: Interpret the opcode and perform the indicated operation. 


e Interrupt: If interrupts are enabled and an interrupt has occurred, save the 
current process state and service the interrupt. 


‘Because the MC68000 already uses 32-bit registers, the MC68020 [MACD84], which is a full 32-bit 
architecture, uses the same register organization. 
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Fetch 


Interrupt Indirect 


Execute 


Figure 12.4 The Instruction Cycle 


We are now in a position to elaborate somewhat on the instruction cycle. First, 
we must introduce one additional stage, known as the indirect cycle. 


The Indirect Cycle 


We have seen, in Chapter 11, that the execution of an instruction may involve one or 
more operands in memory, each of which requires a memory access. Further, if indi- 
rect addressing is used, then additional memory accesses are required. 

We can think of the fetching of indirect addresses as one more instruction 
stages. The result is shown in Figure 12.4. The main line of activity consists of alter- 
nating instruction fetch and instruction execution activities. After an instruction is 
fetched, it is examined to determine if any indirect addressing is involved. If so, the 
required operands are fetched using indirect addressing. Following execution, an in- 
terrupt may be processed before the next instruction fetch. 

Another way to view this process is shown in Figure 12.5, which is a revised 
version of Figure 3.12. This illustrates more correctly the nature of the instruction 
cycle. Once an instruction is fetched, its operand specifiers must be identified. Each 
input operand in memory is then fetched, and this process may require indirect ad- 
dressing. Register-based operands need not be fetched. Once the opcode is executed, 
a similar process may be needed to store the result in main memory. 


Data Flow 


The exact sequence of events during an instruction cycle depends on the design 
of the processor. We can, however, indicate in general terms what must happen. 
Let us assume that a processor that employs a memory address register (MAR), 
a memory buffer register (MBR), a program counter (PC), and an instruction 
register (IR). 

During the fetch cycle, an instruction is read from memory. Figure 12.6 shows 
the flow of data during this cycle. The PC contains the address of the next instruc- 
tion to be fetched. This address is moved to the MAR and placed on the address bus. 
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Figure 12.5 Instruction Cycle State Diagram 
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Figure 12.6 Data Flow, Fetch Cycle 


The control unit requests a memory read, and the result is placed on the data bus 
and copied into the MBR and then moved to the IR. Meanwhile, the PC is incre- 
mented by 1, preparatory for the next fetch. 

Once the fetch cycle is over, the control unit examines the contents of the IR 
to determine if it contains an operand specifier using indirect addressing. If so, an 
indirect cycle is performed. As shown in Figure 12.7, this is a simple cycle. The right- 
most N bits of the MBR, which contain the address reference, are transferred to the 
MAR. Then the control unit requests a memory read, to get the desired address of 
the operand into the MBR. 

The fetch and indirect cycles are simple and predictable. The execute cycle 
takes many forms; the form depends on which of the various machine instructions is 
in the IR. This cycle may involve transferring data among registers, read or write 
from memory or I/O, and/or the invocation of the ALU. 
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Figure 12.7 Data Flow, Indirect Cycle 
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Figure 12.8 Data Flow, Interrupt Cycle 


Like the fetch and indirect cycles, the interrupt cycle is simple and predictable 
(Figure 12.8). The current contents of the PC must be saved so that the processor 
can resume normal activity after the interrupt. Thus, the contents of the PC are 
transferred to the MBR to be written into memory. The special memory location 
reserved for this purpose is loaded into the MAR from the control unit. It might, 
for example, be a stack pointer. The PC is loaded with the address of the interrupt 
routine. As a result, the next instruction cycle will begin by fetching the appropriate 
instruction. 


12.4 INSTRUCTION PIPELINING 


As computer systems evolve, greater performance can be achieved by taking advan- 
tage of improvements in technology, such as faster circuitry. In addition, organiza- 
tional enhancements to the processor can improve performance. We have already 
seen some examples of this, such as the use of multiple registers rather than a single 
accumulator, and the use of a cache memory. Another organizational approach, 
which is quite common, is instruction pipelining. 


Pipelining Strategy 


Instruction pipelining is similar to the use of an assembly line in a manufacturing 
plant. An assembly line takes advantage of the fact that a product goes through var- 
ious stages of production. By laying the production process out in an assembly line, 
products at various stages can be worked on simultaneously. This process is also re- 
ferred to as pipelining, because, as in a pipeline, new inputs are accepted at one end 
before previously accepted inputs appear as outputs at the other end. 

To apply this concept to instruction execution, we must recognize that, in fact, 
an instruction has a number of stages. Figures 12.5, for example, breaks the instruc- 
tion cycle up into 10 tasks, which occur in sequence. Clearly, there should be some 
opportunity for pipelining. 
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Figure 12.9 Two-Stage Instruction Pipeline 


As a simple approach, consider subdividing instruction processing into two 
stages: fetch instruction and execute instruction. There are times during the execu- 
tion of an instruction when main memory is not being accessed. This time could be 
used to fetch the next instruction in parallel with the execution of the current one. 
Figure 12.9a depicts this approach. The pipeline has two independent stages. The 
first stage fetches an instruction and buffers it. When the second stage is free, the first 
stage passes it the buffered instruction. While the second stage is executing the in- 
struction, the first stage takes advantage of any unused memory cycles to fetch and 
buffer the next instruction. This is called instruction prefetch or fetch overlap. Note 
that this approach, which involves instruction buffering, requires more registers. In 
general, pipelining requires registers to store data between stages. 

It should be clear that this process will speed up instruction execution. If the 
fetch and execute stages were of equal duration, the instruction cycle time would be 
halved. However, if we look more closely at this pipeline (Figure 12.9b), we will see 
that this doubling of execution rate is unlikely for two reasons: 


1. The execution time will generally be longer than the fetch time. Execution will 
involve reading and storing operands and the performance of some operation. 
Thus, the fetch stage may have to wait for some time before it can empty its 
buffer. 


. A conditional branch instruction makes the address of the next instruction to 
be fetched unknown. Thus, the fetch stage must wait until it receives the next 
instruction address from the execute stage. The execute stage may then have 
to wait while the next instruction is fetched. 


nN 


Guessing can reduce the time loss from the second reason. A simple rule is the fol- 
lowing: When a conditional branch instruction is passed on from the fetch to the ex- 
ecute stage, the fetch stage fetches the next instruction in memory after the branch 
instruction. Then, if the branch is not taken, no time is lost. If the branch is taken, the 
fetched instruction must be discarded and a new instruction fetched. 
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While these factors reduce the potential effectiveness of the two-stage pipeline, 


some speedup occurs. To gain further speedup, the pipeline must have more stages. 
Let us consider the following decomposition of the instruction processing. 


Fetch instruction (FI): Read the next expected instruction into a buffer. 
Decode instruction (DI): Determine the opcode and the operand specifiers. 


Calculate operands (CO): Calculate the effective address of each source 
operand. This may involve displacement, register indirect, indirect, or other 
forms of address calculation. 


Fetch operands (FO): Fetch each operand from memory. Operands in regis- 
ters need not be fetched. 


Execute instruction (EI): Perform the indicated operation and store the result, 
if any, in the specified destination operand location. 


Write operand (WO): Store the result in memory. 


With this decomposition, the various stages will be of more nearly equal dura- 


tion. For the sake of illustration, let us assume equal duration. Using this assump- 
tion, Figure 12.10 shows that a six-stage pipeline can reduce the execution time for 9 
instructions from 54 time units to 14 time units. 


Several comments are in order: The diagram assumes that each instruction 


goes through all six stages of the pipeline. This will not always be the case. For ex- 
ample, a load instruction does not need the WO stage. However, to simplify the 
pipeline hardware, the timing is set up assuming that each instruction requires all six 
stages. Also, the diagram assumes that all of the stages can be performed in parallel. 
In particular, it is assumed that there are no memory conflicts. For example, the FI, 


1 2 3 4 5 6 7 8 9 | 10 | 11 | 12 | 13 | 14 
Instruction 1 | FI | DI | CO| FO | EI | WoO 
Instruction 2 FI | DI | CO| FO| EI | WO 
Instruction 3 FI | DI | CO} FO| EI | WO 
Instruction 4 FI | DI | CO} FO| EI |} WO 
Instruction 5 FI | DI | CO| FO} EI | WO 
Instruction 6 FI | DI | CO} FO| EI | WO 
Instruction 7 FI | DI | CO} FO| EI | WO 
Instruction 8 FI | DI | CO| FO| EI | Wo 
Instruction 9 FI | DI | CO| FO | EI | Wo 


Time 


Figure 12.10 Timing Diagram for Instruction Pipeline Operation 
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FO, and WO stages involve a memory access. The diagram implies that all these ac- 
cesses can occur simultaneously. Most memory systems will not permit that. Howev- 
er, the desired value may be in cache, or the FO or WO stage may be null. Thus, 
much of the time, memory conflicts will not slow down the pipeline. 

Several other factors serve to limit the performance enhancement. If the six 
stages are not of equal duration, there will be some waiting involved at various 
pipeline stages, as discussed before for the two-stage pipeline. Another difficulty is 
the conditional branch instruction, which can invalidate several instruction fetches. 
A similar unpredictable event is an interrupt. Figure 12.11 illustrates the effects of 
the conditional branch, using the same program as Figure 12.10. Assume that in- 
struction 3 is a conditional branch to instruction 15. Until the instruction is execut- 
ed, there is no way of knowing which instruction will come next. The pipeline, in this 
example, simply loads the next instruction in sequence (instruction 4) and proceeds. 
In Figure 12.10, the branch is not taken, and we get the full performance benefit of 
the enhancement. In Figure 12.11, the branch is taken. This is not determined until 
the end of time unit 7. At this point, the pipeline must be cleared of instructions that 
are not useful. During time unit 8, instruction 15 enters the pipeline. No instructions 
complete during time units 9 through 12; this is the performance penalty incurred 
because we could not anticipate the branch. Figure 12.12 indicates the logic needed 
for pipelining to account for branches and interrupts. 

Other problems arise that did not appear in our simple two-stage organization. 
The CO stage may depend on the contents of a register that could be altered by a 
previous instruction that is still in the pipeline. Other such register and memory con- 
flicts could occur. The system must contain logic to account for this type of conflict. 

To clarify pipeline operation, it might be useful to look at an alternative depic- 
tion. Figures 12.10 and 12.11 show the progression of time horizontally across the 


Time Branch penalty 
1 2 3 4 5 6 7 8 9 | 10 | 11 | 12 | 13 | 14 

Instruction 1 | FI | DI | CO| FO | EI | WO 

Instruction 2 FI | DI | CO; FO | EI | WO 

Instruction 3 FI | DI | CO; FO | EI | WO 

Instruction 4 FI | DI | CO | FO 

Instruction 5 FI | DI | CO 

Instruction 6 FI | DI 

Instruction 7 FI 
Instruction 15 FI | DI | CO| FO| EI 
Instruction 16 FI | DI | CO; FO} EI | WO 


Figure 12.11 The Effect of a Conditional Branch on Instruction Pipeline Operation 
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Figure 12.12 Six-Stage CPU Instruction Pipeline 


figures, with each row showing the progress of an individual instruction. Figure 12.13 
shows same sequence of events, with time progressing vertically down the figure, 
and each row showing the state of the pipeline at a given point in time. In Figure 
12.13a (which corresponds to Figure 12.10), the pipeline is full at time 6, with 6 dif- 
ferent instructions in various stages of execution, and remains full through time 9; 
we assume that instruction I9 is the last instruction to be executed. In Figure 12.13b, 
(which corresponds to Figure 12.11), the pipeline is full at times 6 and 7. At time 7, 
instruction 3 is in the execute stage and executes a branch to instruction 15. At this 
point, instructions I4 through I7 are flushed from the pipeline, so that at time 8, only 


two instructions are in the pipeline, I3 and I15. 


FI | DI |CO|FO| EI |wo 
1 |i 
2/p\u 
3/1B/R/u 
4/4/B/R]u 
515/4/B]2] 1 

»| © [Ie | 15 | 14] 13) 2) 0 

é 7 B| 2 
8 14| 2B 
9 15 | 14 
10 19 | 18 | 17 | 16 | 15 
11 19 | 18 | 17 | 16 
12 19 | 18 | 17 
13 19 | 18 
14 19 


(a) No branches 


12.4 / INSTRUCTION PIPELINING 


Se fF NDAD HN Fe WD YO = 


we Se 
-_ 28 Nn -& S&S 


Figure 12.13 An Alternative Pipeline Depiction 
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From the preceding discussion, it might appear that the greater the number of 
stages in the pipeline, the faster the execution rate. Some of the IBM S/360 design- 
ers pointed out two factors that frustrate this seemingly simple pattern for high- 
performance design [ANDE67a], and they remain elements that designer must still 


consider: 


1. At each stage of the pipeline, there is some overhead involved in moving data 
from buffer to buffer and in performing various preparation and delivery 
functions. This overhead can appreciably lengthen the total execution time of 
a single instruction. This is significant when sequential instructions are logical- 
ly dependent, either through heavy use of branching or through memory ac- 
cess dependencies. 


2. The amount of control logic required to handle memory and register depen- 
dencies and to optimize the use of the pipeline increases enormously with the 
number of stages. This can lead to a situation where the logic controlling the 


gating between stages is more complex than the stages being controlled. 


Another consideration is latching delay: It takes time for pipeline buffers to 
operate and this adds to instruction cycle time. 
Instruction pipelining is a powerful technique for enhancing performance but 


requires careful design to achieve optimum results with reasonable complexity. 
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Pipeline Performance 


In this subsection, we develop some simple measures of pipeline performance and 
relative speedup (based on a discussion in [HWAN93]). The cycle time 7 of an in- 
struction pipeline is the time needed to advance a set of instructions one stage 
through the pipeline; each column in Figures 12.10 and 12.11 represents one cycle 
time. The cycle time can be determined as 

7=max[7)+d=7, +d 15isk 
where 


7; = time delay of the circuitry in the ith stage of the pipeline 
Tm = Maximum stage delay (delay through stage which experiences the 


largest delay) 

k = number of stages in the instruction pipeline 

d = time delay ofa latch, needed to advance signals and data from one stage 
to the next 


In general, the time delay d is equivalent to a clock pulse and 7,, => d. Now 
suppose that n instructions are processed, with no branches. Let T,,, be the total 
time required for a pipeline with k stages to execute n instructions. Then 


Typ = [k+(n-1Jr (12.1) 


A total of k cycles are required to complete the execution of the first instruc- 
tion, and the remaining n — 1 instructions require n — 1 cycles.” This equation is 
easily verified from Figures 12.10. The ninth instruction completes at time cycle 14: 


14=[6+ (9-1) 


Now consider a processor with equivalent functions but no pipeline, and as- 
sume that the instruction cycle time is kr. The speedup factor for the instruction 
pipeline compared to execution without the pipeline is defined as 


Tin nkt nk 
Tee ke Pre De Re Gea) 


Sx (12.2) 

Figure 12.14a plots the speedup factor as a function of the number of instruc- 
tions that are executed without a branch. As might be expected, at the limit (7 — ©0), 
we have a k-fold speedup. Figure 12.14b shows the speedup factor as a function of 
the number of stages in the instruction pipeline.* In this case, the speedup factor ap- 
proaches the number of instructions that can be fed into the pipeline without 
branches. Thus, the larger the number of pipeline stages, the greater the potential for 
speedup. However, as a practical matter, the potential gains of additional pipeline 
stages are countered by increases in cost, delays between stages, and the fact that 
branches will be encountered requiring the flushing of the pipeline. 


?We are being a bit sloppy here. The cycle time will only equal the maximum value of 7 when all the 
stages are full. At the beginning, the cycle time may be less for the first one or few cycles. 


3Note that the x-axis is logarithmic in Figure 12.14a and linear in Figure 12.14b. 
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Figure 12.14 Speedup Factors with Instruction Pipelining 


Pipeline Hazards 


In the previous subsection, we mentioned some of the situations that can result in 
less than optimal pipeline performance. In this subsection, we examine this issue in 
a more systematic way. Chapter 14 revisits this issue, in more detail, after we have in- 
troduced the complexities found in superscalar pipeline organizations. 

A pipeline hazard occurs when the pipeline, or some portion of the pipeline, 
must stall because conditions do not permit continued execution. Such a pipeline 
stall is also referred to as a pipeline bubble. There are three types of hazards: 
resource, data, and control. 


RESOURCE HAZARDS A resource hazard occurs when two (or more) instructions 
that are already in the pipeline need the same resource. The result is that the in- 
structions must be executed in serial rather than parallel for a portion of the 
pipeline. A resource hazard is sometime referred to as a structural hazard. 

Let us consider a simple example of a resource hazard. Assume a simplified five- 
stage pipeline, in which each stage takes one clock cycle. Figure 12.15a shows the ideal 
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Clock cycle 
1 2 3 4 5 6 7 8 9 
Ii | FI | DI | FO| EI | Wo 
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(a) Five-stage pipeline, ideal case 
Clock cycle 
1 2 3 4 5 6 7 8 9 
Ii | FI | DI | FO| EI | Wo 
B 
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s 
i= 
2 2B Idle| FI | DI | FO | EI | wo 
| 
14 FI | DI | FO | EI | WO 


(b) I1 source operand in memory 


Figure 12.15 Example of Resource Hazard 


case, in which a new instruction enters the pipeline each clock cycle. Now assume that 
main memory has a single port and that all instruction fetches and data reads and 
writes must be performed one at a time. Further, ignore the cache. In this case, an 
operand read to or write from memory cannot be performed in parallel with an in- 
struction fetch. This is illustrated in Figure 12.15b, which assumes that the source 
operand for instruction I1 is in memory, rather than a register. Therefore, the fetch in- 
struction stage of the pipeline must idle for one cycle before beginning the instruction 
fetch for instruction I3. The figure assumes that all other operands are in registers. 

Another example of a resource conflict is a situation in which multiple in- 
structions are ready to enter the execute instruction phase and there is a single 
ALU. One solutions to such resource hazards is to increase available resources, such 
as having multiple ports into main memory and multiple ALU units. 


Reservation Table Analyzer 


One approach to analyzing resource conflicts and aiding in the design of 
pipelines is the reservation table. We examine reservation tables in Appendix I. 
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DATA HAZARDS A data hazard occurs when there is a conflict in the access of an 
operand location. In general terms, we can state the hazard in this form: Two in- 
structions in a program are to be executed in sequence and both access a particular 
memory or register operand. If the two instructions are executed in strict sequence, 
no problem occurs. However, if the instructions are executed in a pipeline, then it is 
possible for the operand value to be updated in such a way as to produce a different 
result than would occur with strict sequential execution. In other words, the pro- 
gram produces an incorrect result because of the use of pipelining. 
As an example, consider the following x86 machine instruction sequence: 


ADD EAX, EBX /* BAX = EAX + EBX 
SUB ECX, EAX /* ECX = ECX - EAX 


The first instruction adds the contents of the 32-bit registers EAX and EBX 
and stores the result in EAX. The second instruction subtracts the contents of EAX 
from ECX and stores the result in ECX. Figure 12.16 shows the pipeline behavior. 
The ADD instruction does not update register EAX until the end of stage 5, which 
occurs at clock cycle 5. But the SUB instruction needs that value at the beginning of 
its stage 2, which occurs at clock cycle 4. To maintain correct operation, the pipeline 
must stall for two clocks cycles. Thus, in the absence of special hardware and specif- 
ic avoidance algorithms, such a data hazard results in inefficient pipeline usage. 

There are three types of data hazards; 


e Read after write (RAW), or true dependency: An instruction modifies a regis- 
ter or memory location and a succeeding instruction reads the data in that 
memory or register location. A hazard occurs if the read takes place before the 
write operation is complete. 

e Write after read (RAW), or antidependency: An instruction reads a register or 
memory location and a succeeding instruction writes to the location. A hazard 
occurs if the write operation completes before the read operation takes place. 

e Write after write (RAW), or output dependency: Two instructions both write 
to the same location. A hazard occurs if the write operations take place in the 
reverse order of the intended sequence. 


The example of Figure 12.16 is a RAW hazard. The other two hazards are best 
discussed in the context of superscalar organization, discussed in Chapter 14. 


Clock cycle 
1 2 3 4 5 6 7 8 9 10 


ADD EAX, EBX | FI | DI | FO | EI | WO 


SUB ECX, EAX FI | DI Idle FO | EI | WO 


13 FI DI | FO | EI | WO 


14 FI | DI | FO| EI |WO 


Figure 12.16 Example of Data Hazard 


454 CHAPTER 12 / PROCESSOR STRUCTURE AND FUNCTION 


CONTROL HAZARDS A control hazard, also known as a branch hazard, occurs 
when the pipeline makes the wrong decision on a branch prediction and therefore 
brings instructions into the pipeline that must subsequently be discarded. We discuss 
approaches to dealing with control hazards next. 


Dealing with Branches 


One of the major problems in designing an instruction pipeline is assuring a steady 
flow of instructions to the initial stages of the pipeline. The primary impediment, as 
we have seen, is the conditional branch instruction. Until the instruction is actually 
executed, it is impossible to determine whether the branch will be taken or not. 

A variety of approaches have been taken for dealing with conditional branches: 


e Multiple streams 

e Prefetch branch target 
e Loop buffer 

e Branch prediction 

e Delayed branch 


MULTIPLE STREAMS A simple pipeline suffers a penalty for a branch instruction be- 
cause it must choose one of two instructions to fetch next and may make the wrong 
choice. A brute-force approach is to replicate the initial portions of the pipeline and 
allow the pipeline to fetch both instructions, making use of two streams. There are 
two problems with this approach: 


e With multiple pipelines there are contention delays for access to the registers 
and to memory. 


e Additional branch instructions may enter the pipeline (either stream) before 
the original branch decision is resolved. Each such instruction needs an addi- 
tional stream. 


Despite these drawbacks, this strategy can improve performance. Examples of ma- 
chines with two or more pipeline streams are the IBM 370/168 and the IBM 3033. 


PREFETCH BRANCH TARGET When a conditional branch is recognized, the target 
of the branch is prefetched, in addition to the instruction following the branch. This 
target is then saved until the branch instruction is executed. If the branch is taken, 
the target has already been prefetched. 

The IBM 360/91 uses this approach. 


LOOP BUFFER A loop buffer is a small, very-high-speed memory maintained by the 
instruction fetch stage of the pipeline and containing the n most recently fetched in- 
structions, in sequence. If a branch is to be taken, the hardware first checks whether 
the branch target is within the buffer. If so, the next instruction is fetched from the 
buffer. The loop buffer has three benefits: 


1. With the use of prefetching, the loop buffer will contain some instruction se- 
quentially ahead of the current instruction fetch address. Thus, instructions 
fetched in sequence will be available without the usual memory access time. 
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Figure 12.17 Loop Buffer 


2. If a branch occurs to a target just a few locations ahead of the address of the 
branch instruction, the target will already be in the buffer. This is useful for the 
rather common occurrence of IF-THEN and IF-THEN-ELSE sequences. 


3. This strategy is particularly well suited to dealing with loops, or iterations; 
hence the name Joop buffer. If the loop buffer is large enough to contain all 
the instructions in a loop, then those instructions need to be fetched from 
memory only once, for the first iteration. For subsequent iterations, all the 
needed instructions are already in the buffer. 


The loop buffer is similar in principle to a cache dedicated to instructions. The 
differences are that the loop buffer only retains instructions in sequence and is 
much smaller in size and hence lower in cost. 

Figure 12.17 gives an example of a loop buffer. If the buffer contains 256 bytes, 
and byte addressing is used, then the least significant 8 bits are used to index the 
buffer. The remaining most significant bits are checked to determine if the branch 
target lies within the environment captured by the buffer. 

Among the machines using a loop buffer are some of the CDC machines (Star- 
100, 6600, 7600) and the CRAY-1.A specialized form of loop buffer is available on 
the Motorola 68010, for executing a three-instruction loop involving the DBcc 
(decrement and branch on condition) instruction (see Problem 12.14). A three-word 
buffer is maintained, and the processor executes these instructions repeatedly until 
the loop condition is satisfied. 


Branch Prediction Simulator 
Branch Target Buffer 


BRANCH PREDICTION Various techniques can be used to predict whether a branch 
will be taken. Among the more common are the following: 


e Predict never taken 


e Predict always taken 
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e Predict by opcode 
e Taken/not taken switch 
e Branch history table 


The first three approaches are static: they do not depend on the execution his- 
tory up to the time of the conditional branch instruction. The latter two approaches 
are dynamic: They depend on the execution history. 

The first two approaches are the simplest. These either always assume that the 
branch will not be taken and continue to fetch instructions in sequence, or they 
always assume that the branch will be taken and always fetch from the branch tar- 
get. The predict-never-taken approach is the most popular of all the branch predic- 
tion methods. 

Studies analyzing program behavior have shown that conditional branches are 
taken more than 50% of the time [LILJ88], and so if the cost of prefetching from 
either path is the same, then always prefetching from the branch target address 
should give better performance than always prefetching from the sequential path. 
However, in a paged machine, prefetching the branch target is more likely to cause 
a page fault than prefetching the next instruction in sequence, and so this perfor- 
mance penalty should be taken into account. An avoidance mechanism may be em- 
ployed to reduce this penalty. 

The final static approach makes the decision based on the opcode of the 
branch instruction. The processor assumes that the branch will be taken for certain 
branch opcodes and not for others. [LILJ88] reports success rates of greater than 
75% with this strategy. 

Dynamic branch strategies attempt to improve the accuracy of prediction by 
recording the history of conditional branch instructions in a program. For example, 
one or more bits can be associated with each conditional branch instruction that re- 
flect the recent history of the instruction. These bits are referred to as a taken/not 
taken switch that directs the processor to make a particular decision the next time 
the instruction is encountered. Typically, these history bits are not associated with 
the instruction in main memory. Rather, they are kept in temporary high-speed stor- 
age. One possibility is to associate these bits with any conditional branch instruction 
that is in a cache. When the instruction is replaced in the cache, its history is lost. An- 
other possibility is to maintain a small table for recently executed branch instruc- 
tions with one or more history bits in each entry. The processor could access the 
table associatively, like a cache, or by using the low-order bits of the branch instruc- 
tion’s address. 

With a single bit, all that can be recorded is whether the last execution of this 
instruction resulted in a branch or not. A shortcoming of using a single bit appears 
in the case of a conditional branch instruction that is almost always taken, such as a 
loop instruction. With only one bit of history, an error in prediction will occur twice 
for each use of the loop: once on entering the loop, and once on exiting. 

If two bits are used, they can be used to record the result of the last two instances 
of the execution of the associated instruction, or to record a state in some other fash- 
ion. Figure 12.18 shows a typical approach (see Problem 12.13 for other possibilities). 
Assume that the algorithm starts at the upper-left-hand corner of the flowchart. As 
long as each succeeding conditional branch instruction that is encountered is taken, 


Read next 
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Figure 12.18 Branch Prediction Flowchart 


the decision process predicts that the next branch will be taken. If a single predic- 
tion is wrong, the algorithm continues to predict that the next branch is taken. Only 
if two successive branches are not taken does the algorithm shift to the right-hand 
side of the flowchart. Subsequently, the algorithm will predict that branches are not 
taken until two branches in a row are taken. Thus, the algorithm requires two con- 
secutive wrong predictions to change the prediction decision. 

The decision process can be represented more compactly by a finite-state ma- 
chine, shown in Figure 12.19. The finite-state machine representation is commonly 
used in the literature. 

The use of history bits, as just described, has one drawback: If the decision is 
made to take the branch, the target instruction cannot be fetched until the target ad- 
dress, which is an operand in the conditional branch instruction, is decoded. Greater 
efficiency could be achieved if the instruction fetch could be initiated as soon as the 
branch decision is made. For this purpose, more information must be saved, in what 
is known as a branch target buffer, or a branch history table. 

The branch history table is a small cache memory associated with the instruc- 
tion fetch stage of the pipeline. Each entry in the table consists of three elements: 
the address of a branch instruction, some number of history bits that record the 
state of use of that instruction, and information about the target instruction. In 
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Figure 12.19 Branch Prediction State Diagram 


most proposals and implementations, this third field contains the address of the tar- 
get instruction. Another possibility is for the third field to actually contain the tar- 
get instruction. The trade-off is clear: Storing the target address yields a smaller 
table but a greater instruction fetch time compared with storing the target instruc- 
tion [RECH98]. 

Figure 12.20 contrasts this scheme with a predict-never-taken strategy. With 
the former strategy, the instruction fetch stage always fetches the next sequential ad- 
dress. If a branch is taken, some logic in the processor detects this and instructs that 
the next instruction be fetched from the target address (in addition to flushing the 
pipeline). The branch history table is treated as a cache. Each prefetch triggers a 
lookup in the branch history table. If no match is found, the next sequential address 
is used for the fetch. If a match is found, a prediction is made based on the state of 
the instruction: Either the next sequential address or the branch target address is 
fed to the select logic. 

When the branch instruction is executed, the execute stage signals the branch 
history table logic with the result. The state of the instruction is updated to reflect a 
correct or incorrect prediction. If the prediction is incorrect, the select logic is redi- 
rected to the correct address for the next fetch. When a conditional branch instruc- 
tion is encountered that is not in the table, it is added to the table and one of the 
existing entries is discarded, using one of the cache replacement algorithms dis- 
cussed in Chapter 4. 

A refined of the branch history approach is referred to as two-level or correlation- 
based branch history [YEH91]. This approach is based on the assumption that 
whereas in loop-closing branches, the past history of a particular branch instruction 
is a good predictor of future behavior, with more complex control-flow structures, 
the direction of a branch is frequently correlated with the direction of related 
branches. An example is an if-then-else or case structure. There are a number of 
strategies possible. Typically, recent global branch history (ie., the history of the 
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Figure 12.20 Dealing with Branches 


most recent branches not just of this branch instruction) is used in addition to the 
history of the current branch instruction. The general structure is defined as an 
(m,n) correlator, which uses the behavior of the last m branches to choose from 2” 
n-bit branch predictors for the current branch instruction. In other words, an n-bit 
history is kept for a give branch for each possible combination of branches taken by 
the most recent m branches. 


DELAYED BRANCH It is possible to improve pipeline performance by automatically 
rearranging instructions within a program, so that branch instructions occur later 
than actually desired. This intriguing approach is examined in Chapter 13. 
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Intel 80486 Pipelining 


An instructive example of an instruction pipeline is that of the Intel 80486. The 
80486 implements a five-stage pipeline: 


e Fetch: Instructions are fetched from the cache or from external memory and 
placed into one of the two 16-byte prefetch buffers. The objective of the fetch 
stage is to fill the prefetch buffers with new data as soon as the old data have 
been consumed by the instruction decoder. Because instructions are of vari- 
able length (from 1 to 11 bytes not counting prefixes), the status of the 
prefetcher relative to the other pipeline stages varies from instruction to in- 
struction. On average, about five instructions are fetched with each 16-byte 
load [CRAW90]. The fetch stage operates independently of the other stages to 
keep the prefetch buffers full. 


e Decode stage 1: All opcode and addressing-mode information is decoded in 
the D1 stage. The required information, as well as instruction-length informa- 
tion, is included in at most the first 3 bytes of the instruction. Hence, 3 bytes 
are passed to the D1 stage from the prefetch buffers. The D1 decoder can then 
direct the D2 stage to capture the rest of the instruction (displacement and im- 
mediate data), which is not involved in the D1 decoding. 


e Decode stage 2: The D2 stage expands each opcode into control signals for the 
ALU It also controls the computation of the more complex addressing modes. 


e Execute: This stage includes ALU operations, cache access, and register up- 
date. 


e Write back: This stage, if needed, updates registers and status flags modified 
during the preceding execute stage. If the current instruction updates memory, 
the computed value is sent to the cache and to the bus-interface write buffers 
at the same time. 


With the use of two decode stages, the pipeline can sustain a throughput of 
close to one instruction per clock cycle. Complex instructions and conditional 
branches can slow down this rate. 

Figure 12.21 shows examples of the operation of the pipeline. Part a shows that 
there is no delay introduced into the pipeline when a memory access is required. 
However, as part b shows, there can be a delay for values used to compute memory 
addresses. That is, if a value is loaded from memory into a register and that register 
is then used as a base register in the next instruction, the processor will stall for one 
cycle. In this example, the processor accesses the cache in the EX stage of the first 
instruction and stores the value retrieved in the register during the WB stage. How- 
ever, the next instruction needs this register in its D2 stage. When the D2 stage lines 
up with the WB stage of the previous instruction, bypass signal paths allow the D2 
stage to have access to the same data being used by the WB stage for writing, saving 
one pipeline stage. 

Figure 12.21c illustrates the timing of a branch instruction, assuming that the 
branch is taken. The compare instruction updates condition codes in the WB stage, 
and bypass paths make this available to the EX stage of the jump instruction at the 
same time. In parallel, the processor runs a speculative fetch cycle to the target of 
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Figure 12.21 80486 Instruction Pipeline Examples 


the jump during the EX stage of the jump instruction. If the processor determines a 
false branch condition, it discards this prefetch and continues execution with the 
next sequential instruction (already fetched and decoded). 


12.5 THE x86 PROCESSOR FAMILY 


The x86 organization has evolved dramatically over the years. In this section we ex- 
amine some of the details of the most recent processor organizations, concentrating 
on common elements in single processors. Chapter 14 looks at superscalar aspects of 
the x86, and Chapter 18 examines the multicore organization. An overview of the 
Pentium 4 processor organization is depicted in Figure 4.18. 


Register Organization 
The register organization includes the following types of registers (Table 12.2): 


e General: There are eight 32-bit general-purpose registers (see Figure 12.3c). 
These may be used for all types of x86 instructions; they can also hold operands 
for address calculations. In addition, some of these registers also serve special 
purposes. For example, string instructions use the contents of the ECX, ESI, 
and EDI registers as operands without having to reference these registers ex- 
plicitly in the instruction. As a result, a number of instructions can be encoded 
more compactly. In 64-bit mode, there are 16 64-bit general-purpose registers. 


e Segment: The six 16-bit segment registers contain segment selectors, which 
index into segment tables, as discussed in Chapter 8. The code segment (CS) 
register references the segment containing the instruction being executed. The 
stack segment (SS) register references the segment containing a user-visible 
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Table 12.2 x86 Processor Registers 


Type 


(a) Integer Unit in 32-bit Mode 


Number Length (bits) Purpose 


General 
Segment 
EFLAGS 


32 General-purpose user registers 
16 Contain segment selectors 


32 Status and control bits 


Instruction Pointer 32 Instruction pointer 


Type 


(b) Integer Unit in 64-bit Mode 
Number Length (bits) Purpose 


General 
Segment 
RFLAGS 


oo General-purpose user registers 
16 Contain segment selectors 


64 Status and control bits 


Instruction Pointer 64 Instruction pointer 


Type 


(c) Floating-Point Unit 


Number Length (bits) Purpose 


Numeric 
Control 
Status 
Tag Word 


80 Hold floating-point numbers 
16 Control bits 
16 Status bits 


16 Specifies contents of numeric registers 


Instruction Pointer 48 Points to instruction interrupted by exception 


Data Pointer 


48 Points to operand interrupted by exception 


stack. The remaining segment registers (DS, ES, FS, GS) enable the user to ref- 
erence up to four separate data segments at a time. 


Flags: The 32-bit EFLAGS register contains condition codes and various 
mode bits. In 64-bit mode, this register is extended to 64 bits and referred to as 
RFLAGS. In the current architecture definition, the upper 32 bits of RFLAGS 
are unused. 


Instruction pointer: Contains the address of the current instruction. 
There are also registers specifically devoted to the floating-point unit: 


Numeric: Each register holds an extended-precision 80-bit floating-point 
number. There are eight registers that function as a stack, with push and pop 
operations available in the instruction set. 


Control: The 16-bit control register contains bits that control the operation of 
the floating-point unit, including the type of rounding control; single, double, or 
extended precision; and bits to enable or disable various exception conditions. 


Status: The 16-bit status register contains bits that reflect the current state of 
the floating-point unit, including a 3-bit pointer to the top of the stack; condi- 
tion codes reporting the outcome of the last operation; and exception flags. 
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e Tag word: This 16-bit register contains a 2-bit tag for each floating-point 
numeric register, which indicates the nature of the contents of the correspond- 
ing register. The four possible values are valid, zero, special (NaN, infinity, de- 
normalized), and empty. These tags enable programs to check the contents of a 
numeric register without performing complex decoding of the actual data in 
the register. For example, when a context switch is made, the processor need 
not save any floating-point registers that are empty. 


The use of most of the aforementioned registers is easily understood. Let us 
elaborate briefly on several of the registers. 


EFLAGS REGISTER The EFLAGS register (Figure 12.22) indicates the condition 
of the processor and helps to control its operation. It includes the six condition 
codes defined in Table 10.9 (carry, parity, auxiliary, zero, sign, overflow), which re- 
port the results of an integer operation. In addition, there are bits in the register that 
may be referred to as control bits: 


e Trap flag (TF): When set, causes an interrupt after the execution of each 
instruction. This is used for debugging. 


e Interrupt enable flag (IF): When set, the processor will recognize external 
interrupts. 


e Direction flag (DF): Determines whether string processing instructions incre- 
ment or decrement the 16-bit half-registers SI and DI (for 16-bit operations) 
or the 32-bit registers ESI and EDI (for 32-bit operations). 


e I/O privilege flag (IOPL): When set, causes the processor to generate an ex- 
ception on all accesses to I/O devices during protected-mode operation. 


e Resume flag (RF): Allows the programmer to disable debug exceptions so 
that the instruction can be restarted after a debug exception without immedi- 
ately causing another debug exception. 


e Alignment check (AC): Activates if a word or doubleword is addressed on a 
nonword or nondoubleword boundary. 


e Identification flag (ID): If this bit can be set and cleared, then this processor 
supports the processorID instruction. This instruction provides information 
about the vendor, family, and model. 


31 21 16/15 0 
SHHEME BESSERREE A] [Ph fc 
Dip C|M|F T| PL |F/F|F/F\F/F F F F 

ID = Identification flag DF = Direction flag 

VIP = Virtual interrupt pending IF = Interrupt enable flag 

VIF = Virtual interrupt flag TF = Trap flag 

AC = Alignment check SF = Sign flag 

VM = Virtual 8086 mode ZF = Zero flag 

RF = Resume flag AF = Auxiliary carry flag 

NT = Nested task flag PF = Parity flag 

IOPL = I/O privilege level CF = Carry flag 

OF = Overflow flag 


Figure 12.22 Pentium II EFLAGS Register 
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In addition, there are 4 bits that relate to operating mode. The Nested Task 
(NT) flag indicates that the current task is nested within another task in protected- 
mode operation. The Virtual Mode (VM) bit allows the programmer to enable or 
disable virtual 8086 mode, which determines whether the processor runs as an 8086 
machine. The Virtual Interrupt Flag (VIF) and Virtual Interrupt Pending (VIP) flag 
are used in a multitasking environment. 


CONTROL REGISTERS The x86 employs four control registers (register CR1 is un- 
used) to control various aspects of processor operation (Figure 12.23). All of the regis- 
ters except CRO are either 32 bits or 64 bits long, depending on whether the 
implementation supports the x86 64-bit architecture. The CRO register contains system 


(63) OSXSAVE OSXMMEXCPT OSFXSR 
S Dana aan ieee aa 6543210 
S|V 
P|P|M|P/P T|P|V 
ne c]G/cja}s|P]s|v [mM] cra 
EIE E/E|JE/E/E D/I/E 
P|P 
Page-directory base C|W CR3 
D/|T (PDBR) 
Page-fault linear address CR2 
CRI 
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 1615 14131211109 8 76543210 
P/C|N A W N|E/T|E/M|P CRO 
G|D|W M P E|T|/S|M/PJE 
Shaded area indicates reserved bits. 
OSXSAVE = XSAVE enable bit PCD = Page-level cache disable 
SMXE = Enable safer mode extensions PWT = Page-level writes transparent 
VMXE = Enable virtual machine extensions PG = Paging 
OSXMMEXCPT = Support unmasked SIMD FP exceptions CD = Cache disable 
OSFXSR = Support FXSAVE, FXSTOR NW = Not write through 
PCE = Performance counter enable AM = Alignment mask 
PGE = Page global enable WP = Write protect 
MCE = Machine check enable NE = Numeric error 
PAE = Physical address extension ET = Extension type 
PSE = Page size extensions TS = Task switched 
DE = Debug extensions EM = Emulation 
TSD = Time stamp disable MP = Monitor coprocessor 
PVI = Protected mode virtual interrupt PE = Protection enable 
VME = Virtual 8086 mode extensions 


Figure 12.23 x86 Control Registers 
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control flags, which control modes or indicate states that apply generally to the proces- 
sor rather than to the execution of an individual task. The flags are are as follows: 


Protection Enable (PE): Enable/disable protected mode of operation. 


Monitor Coprocessor (MP): Only of interest when running programs from 
earlier machines on the x86; it relates to the presence of an arithmetic 
coprocessor. 


Emulation (EM): Set when the processor does not have a floating-point unit, 
and causes an interrupt when an attempt is made to execute floating-point in- 
structions. 


Task Switched (TS): Indicates that the processor has switched tasks. 


Extension Type (ET): Not used on the Pentium and later machines; used to in- 
dicate support of math coprocessor instructions on earlier machines. 


Numeric Error (NE): Enables the standard mechanism for reporting floating- 
point errors on external bus lines. 


Write Protect (WP): When this bit is clear, read-only user-level pages can be 
written by a supervisor process. This feature is useful for supporting process 
creation in some operating systems. 


Alignment Mask (AM): Enables/disables alignment checking. 


Not Write Through (NW): Selects mode of operation of the data cache. When 
this bit is set, the data cache is inhibited from cache write-through operations. 


Cache Disable (CD): Enables/disables the internal cache fill mechanism. 
Paging (PG): Enables/disables paging. 
When paging is enabled, the CR2 and CR3 registers are valid. The CR2 regis- 


ter holds the 32-bit linear address of the last page accessed before a page fault inter- 
rupt. The leftmost 20 bits of CR3 hold the 20 most significant bits of the base address 
of the page directory; the remainder of the address contains zeros. Two bits of CR3 
are used to drive pins that control the operation of an external cache. The page-level 
cache disable (PCD) enables or disables the external cache, and the page-level 
writes transparent (PWT) bit controls write through in the external cache. 


Nine additional control bits are defined in CR4: 
Virtual-8086 Mode Extension (VME): Enables support for the virtual inter- 
rupt flag in virtual-8086 mode. 


Protected-mode Virtual Interrupts (PVI): Enables support for the virtual in- 
terrupt flag in protected mode. 

Time Stamp Disable (TSD): Disables the read from time stamp counter 
(RDTSC) instruction, which is used for debugging purposes. 

Debugging Extensions (DE): Enables I/O breakpoints; this allows the proces- 
sor to interrupt on I/O reads and writes. 

Page Size Extensions (PSE): Enables large page sizes (2 or 4-MByte pages) 
when set; restricts pages to 4 KBytes when clear. 

Physical Address Extension (PAE): Enables address lines A35 through A32 
whenever a special new addressing mode, controlled by the PSE, is enabled. 
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e Machine Check Enable (MCE): Enables the machine check interrupt, which 
occurs when a data parity error occurs during a read bus cycle or when a bus 
cycle is not successfully completed. 


e Page Global Enable (PGE): Enables the use of global pages. When PGE = 1 
and a task switch is performed, all of the TLB entries are flushed with the ex- 
ception of those marked global. 


e Performance Counter Enable (PCE): Enables the execution of the RDPMC 
(read performance counter) instruction at any privilege level. Two perfor- 
mance counters are used to measure the duration of a specific event type and 
the number of occurrences of a specific event type. 


MIMX REGISTERS Recall from Section 10.3 that the x86 MMX capability makes 
use of several 64-bit data types. The MMX instructions make use of 3-bit register ad- 
dress fields, so that eight MMX registers are supported. In fact, the processor does 
not include specific MMX registers. Rather, the processor uses an aliasing technique 
(Figure 12.24). The existing floating-point registers are used to store MMX values. 
Specifically, the low-order 64 bits (mantissa) of each floating-point register are used 
to form the eight MMX registers. Thus, the older 32-bit x86 architecture is easily ex- 
tended to support the MMX capability. Some key characteristics of the MMX use of 
these registers are as follows: 


e Recall that the floating-point registers are treated as a stack for floating-point 
operations. For MMX operations, these same registers are accessed directly. 


e The first time that an MMX instruction is executed after any floating-point op- 
erations, the FP tag word is marked valid. This reflects the change from stack 
operation to direct register addressing. 


Floating-point 
tag Floating-point registers 


MMxX registers 


Figure 12.24 Mapping of MMX Registers to 
Floating-Point Registers 
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e The EMMS (Empty MMxX State) instruction sets bits of the FP tag word to in- 
dicate that all registers are empty. It is important that the programmer insert 
this instruction at the end of an MMX code block so that subsequent floating- 
point operations function properly. 


e When a value is written to an MMX register, bits [79:64] of the corresponding 
FP register (sign and exponent bits) are set to all ones. This sets the value in 
the FP register to NaN (not a number) or infinity when viewed as a floating- 
point value. This ensures that an MMX data value will not look like a valid 
floating-point value. 


Interrupt Processing 


Interrupt processing within a processor is a facility provided to support the operat- 
ing system. It allows an application program to be suspended, in order that a variety 
of interrupt conditions can be serviced and later resumed. 


INTERRUPTS AND EXCEPTIONS Two classes of events cause the x86 to suspend ex- 
ecution of the current instruction stream and respond to the event: interrupts and ex- 
ceptions. In both cases, the processor saves the context of the current process and 
transfers to a predefined routine to service the condition. An interrupt is generated by 
a signal from hardware, and it may occur at random times during the execution of a 
program. An exception is generated from software, and it is provoked by the execution 
of an instruction. There are two sources of interrupts and two sources of exceptions: 


1. Interrupts 


e Maskable interrupts: Received on the processor’s INTR pin. The processor 
does not recognize a maskable interrupt unless the interrupt enable flag 
(IF) is set. 


e Nonmaskable interrupts: Received on the processor’s NMI pin. Recogni- 
tion of such interrupts cannot be prevented. 


2. Exceptions 


e Processor-detected exceptions: Results when the processor encounters an 
error while attempting to execute an instruction. 


e Programmed exceptions: These are instructions that generate an exception 
(e.g., INTO, INT3, INT, and BOUND). 


INTERRUPT VECTOR TABLE Interrupt processing on the x86 uses the interrupt vec- 
tor table. Every type of interrupt is assigned a number, and this number is used to 
index into the interrupt vector table. This table contains 256 32-bit interrupt vectors, 
which is the address (segment and offset) of the interrupt service routine for that in- 
terrupt number. 

Table 12.3 shows the assignment of numbers in the interrupt vector table; 
shaded entries represent interrupts, while nonshaded entries are exceptions. The 
NMI hardware interrupt is type 2. INTR hardware interrupts are assigned numbers 
in the range of 32 to 255; when an INTR interrupt is generated, it must be accompa- 
nied on the bus with the interrupt vector number for this interrupt. The remaining 
vector numbers are used for exceptions. 
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Table 12.3. x86 Exception and Interrupt Vector Table 


Vector Number Description 


Divide error; division overflow or division by zero 

Debug exception; includes various faults and traps related to debugging 

NMI pin interrupt; signal on NMI pin 

Breakpoint; caused by INT 3 instruction, which is a 1-byte instruction useful for 
debugging 

INTO-detected overflow; occurs when the processor executes INTO with the OF 
flag set 


BOUND range exceeded; the BOUND instruction compares a register with 
boundaries stored in memory and generates an interrupt if the contents of the 
register is out of bounds. 


Undefined opcode 


Device not available; attempt to use ESC or WAIT instruction fails due to lack of 
external device 


Double fault; two interrupts occur during the same instruction and cannot be handled 
serially 


Reserved 


Invalid task state segment; segment describing a requested task is not initialized or 
not valid 


Segment not present; required segment not present 
Stack fault; limit of stack segment exceeded or stack segment not present 


General protection; protection violation that does not cause another exception 
(e.g., writing to a read-only segment) 


Page fault 
Reserved 
Floating-point error; generated by a floating-point arithmetic instruction 


Alignment check; access to a word stored at an odd byte address or a doubleword 
stored at an address not a multiple of 4 


Machine check; model specific 


Reserved 


User interrupt vectors; provided when INTR signal is activated 


Unshaded: exceptions 
Shaded: interrupts 


If more than one exception or interrupt is pending, the processor services 
them in a predictable order. The location of vector numbers within the table does 
not reflect priority. Instead, priority among exceptions and interrupts is organized 
into five classes. In descending order of priority, these are 

e Class 1: Traps on the previous instruction (vector number 1) 

e Class 2: External interrupts (2, 32-255) 

e Class 3: Faults from fetching next instruction (3, 14) 

e Class 4: Faults from decoding the next instruction (6, 7) 

e Class 5: Faults on executing an instruction (0, 4,5, 8, 10-14, 16, 17) 
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INTERRUPT HANDLING Just as with a transfer of execution using a CALL instruc- 
tion, a transfer to an interrupt-handling routine uses the system stack to store the 
processor state. When an interrupt occurs and is recognized by the processor, a 
sequence of events takes place: 


1. If the transfer involves a change of privilege level, then the current stack seg- 
ment register and the current extended stack pointer (ESP) register are pushed 
onto the stack. 


2. The current value of the EFLAGS register is pushed onto the stack. 


3. Both the interrupt (IF) and trap (TF) flags are cleared. This disables INTR in- 
terrupts and the trap or single-step feature. 


4. The current code segment (CS) pointer and the current instruction pointer (IP 
or EJP) are pushed onto the stack. 


5. If the interrupt is accompanied by an error code, then the error code is pushed 
onto the stack. 


6. The interrupt vector contents are fetched and loaded into the CS and IP or 
EIP registers. Execution continues from the interrupt service routine. 


To return from an interrupt, the interrupt service routine executes an IRET 
instruction. This causes all of the values saved on the stack to be restored; execution 
resumes from the point of the interrupt. 


12.6 THE ARM PROCESSOR 


In this section, we look at some of the key elements of the ARM architecture and 
organization. We defer a discussion of more complex aspects of organization and 
pipelining until Chapter 14. For the discussion in this section and in Chapter 14, it is 
useful to keep in mind key characteristics of the ARM architecture. ARM is primar- 
ily a RISC system with the following notable attributes: 


e A moderate array of uniform registers, more than are found on some CISC 
systems but fewer than are found on many RISC systems. 


e A load/store model of data processing, in which operations only perform on 
operands in registers and not directly in memory. All data must be loaded into 
registers before an operation can be performed; the result can then be used for 
further processing or stored into memory. 


e A uniform fixed-length instruction of 32 bits for the standard set and 16 bits 
for the Thumb instruction set. 


e To make each data processing instruction more flexible, either a shift or rota- 
tion can preprocess one of the source registers. To efficiently support this 
feature, there are separate arithmetic logic unit (ALU) and shifter units. 


e A small number of addressing modes with all load/store addressees deter- 
mined from registers and instruction fields. Indirect or indexed addressing 
involving values in memory are not used. 


e Auto-increment and auto-decrement addressing modes are used to improve 
the operation of program loops. 
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e Conditional execution of instructions minimizes the need for conditional branch 
instructions, thereby improving pipeline efficiency, because pipeline flushing 
is reduced. 


Processor Organization 


The ARM processor organization varies substantially from one implementation to 
the next, particularly when based on different versions of the ARM architecture. 
However, it is useful for the discussion in this section to present a simplified, generic 
ARM organization, which is illustrated in Figure 12.25. In this figure, the arrows indi- 
cate the flow of data. Each box represents a functional hardware unit or a storage unit. 

Data are exchanged with the processor from external memory through a data 
bus. The value transferred is either a data item, as a result of a load or store instruc- 
tion, or an instruction fetch. Fetched instructions pass through an instruction decoder 
before execution, under control of a control unit. The latter includes pipeline logic and 
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Memory address register Memory buffer register 
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R15 (PC) 
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Instruction 
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Figure 12.25 Simplified ARM Organization 
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provides control signals (not shown) to all the hardware elements of the processor. 
Data items are placed in the register file, consisting of a set of 32-bit registers. Byte or 
halfword items treated as twos-complement numbers are sign-extended to 32 bits. 

ARM data processing instructions typically have two source registers, Rn and 
Rm, and a single result or destination register, Rd. The source register values feed 
into the ALU or a separate multiply unit that makes use of an additional register to 
accumulate partial results. The ARM processor also includes a hardware unit that 
can shift or rotate the Rm value before it enters the ALU. This shift or rotate occurs 
within the cycle time of the instruction and increases the power and flexibility of 
many data processing operations. 

The results of an operation are fed back to the destination register. Load/store 
instructions may also use the output of the arithmetic units to generate the memory 
address for a load or store. 


Processor Modes 


It is quite common for a processor to support only a small number of processor 
modes. For example, many operating systems make use of just two modes: a user 
mode and a kernel mode, with the latter mode used to execute privileged system 
software. In contrast, the ARM architecture provides a flexible foundation for oper- 
ating systems to enforce a variety of protection policies. 

The ARM architecture supports seven execution modes. Most application pro- 
grams execute in user mode. While the processor is in user mode, the program being 
executed is unable to access protected system resources or to change mode, other 
than by causing an exception to occur. 

The remaining six execution modes are referred to as privileged modes. These 
modes are used to run system software. There are two principal advantages to defin- 
ing so many different privileged modes: (1) The OS can tailor the use of system soft- 
ware to a variety of circumstances, and (2) certain registers are dedicated for use for 
each of the privileged modes, allows swifter changes in context. 

The exception modes have full access to system resources and can change 
modes freely. Five of these modes are known as exception modes. These are entered 
when specific exceptions occur. Each of these modes has some dedicated registers 
that substitute for some of the user mode registers, and which are used to avoid cor- 
rupting User mode state information when the exception occurs. The exception 
modes are as follows: 


e Supervisor mode: Usually what the OS runs in. It is entered when the processor 
encounters a software interrupt instruction. Software interrupts are a standard 
way to invoke operating system services on ARM. 


e Abort mode: Entered in response to memory faults. 


e Undefined mode: Entered when the processor attempts to execute an instruc- 
tion that is supported neither by the main integer core nor by one of the 
coprocessors. 


e Fast interrupt mode: Entered whenever the processor receives an interrupt 
signal from the designated fast interrupt source. A fast interrupt cannot be in- 
terrupted, but a fast interrupt may interrupt a normal interrupt. 
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e Interrupt mode: Entered whenever the processor receives an interrupt signal 
from any other interrupt source (other than fast interrupt). An interrupt may 
only be interrupted by a fast interrupt. 


The remaining privileged mode is the System mode. This mode is not entered 
by any exception and uses the same registers available in User mode. The System 
mode is used for running certain privileged operating system tasks. System mode 
tasks may be interrupted by any of the five exception categories. 


Register Organization 


Figure 12.26 depicts the user-visible registers for the ARM. The ARM processor has 
a total of 37 32-bit registers, classified as follows: 


e Thirty-one registers referred to in the ARM manual as general-purpose regis- 
ters. In fact, some of these, such as the program counters, have special purposes. 


e Six program status registers. 


Registers are arranged in partially overlapping banks, with the current proces- 
sor mode determining which bank is available. At any time, sixteen numbered regis- 
ters and one or two program status registers are visible, for a total of 17 or 18 
software-visible registers. Figure 12.26 is interpreted as follows: 


e Registers RO through R7, register R15 (the program counter) and the current 
program status register (CPSR) are visible in and shared by all modes. 


e Registers R8 through R12 are shared by all modes except fast interrupt, which 
has its own dedicated registers R8_fiq through R12_fiq. 


e All the exception modes have their own versions of registers R13 and R14. 


e All the exception modes have a dedicated saved program status register (SPSR) 


GENERAL-PURPOSE REGISTERS Register R13 is normally used as a stack pointer 
and is also known as the SP. Because each exception mode has a separate R13, each 
exception mode can have its own dedicated program stack. R14 is known as the link 
register (LR) and is used to hold subroutine return addresses and exception mode 
returns. Register R15 is the program counter (PC). 


PROGRAM STATUS REGISTERS The CPSR is accessible in all processor modes. 
Each exception mode also has a dedicated SPSR that is used to preserve the value 
of the CPSR when the associated exception occurs. 

The 16 most significant bits of the CPSR contain user flags visible in User 
mode, and which can be used to affect the operation of a program (Figure 12.27). 
These are as follows: 


e Condition code flags: The N, Z, C and V flags, which are discussed in Chapter 10. 
e Q flag: used to indicate whether overflow and/or saturation has occurred in 
some SIMD-oriented instructions. 


e J bit: indicates the use of special 8-bit instructions, known as Jazelle instruc- 
tions, which are beyond the scope of our discussion. 
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Modes 


Privileged modes 


Exception modes 


Supervisor Abort Undefined Interrupt Fast interrupt 


R13(SP) R13(SP) R13_sve R13_abt R13_und R13_irq R13_fiq 
R14(LR) R14(LR) R14 sve R14_abt R14_und R14 _irq R14_fiq 
R1S(PC) | RI5(PC) R15(PC) R15(PC) R1S5(PC) R15(PC) R15(PC) 
| CPSR CPSR CPSR CPSR CPSR CPSR CPSR 
SPSR_sve SPSR_abt SPSR_und SPSR_irq SPSR_fiq 


Shading indicates that the normal register used by User or System mode has been replaced by an alternative 
register specific to the exception mode. 


SP = stack pointer CPSR = current program status register 
LR = link register SPSR = saved program status register 
PC = program counter 


Figure 12.26 ARM Register Organization 


¢ GE[3:0] bits: SIMD instructions use bits[19:16] as Greater than or Equal (GE) 
flags for individual bytes or halfwords of the result. 


The 16 least significant bits of the CPSR contain system control flags that can 
only be altered when the processor is in a privileged mode. The fields are as follows: 
e E bit: Controls load and store endianness for data; ignored for instruction fetches. 


e Interrupt disable bits: The A bit disables imprecise data aborts when set; the I bit 
disables IRQ interrupts when set; and the F bit disables FIQ interrupts when set. 


¢ T bit: Indicates whether instructions should be interpreted as normal ARM in- 
structions or Thumb instructions. 


e Mode bits: Indicates the processor mode 
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N/|Z|C}V|Q] Res | J} Reserved | GE[3:0] Reserved E|A| I] F/T M[4:0] 
XN rN 7 
Ma yr 

User flags System control flags 


Figure 12.27 Format of ARM CPSR and SPSR 


Interrupt Processing 


As with any processor, the ARM includes a facility that enables the processor to in- 
terrupt the currently executing program to deal with exception conditions. Exceptions 
are generated by internal and external sources to cause the processor to handle an 
event. The processor state just before handling the exception is normally preserved 
so that the original program can be resumed when the exception routine has com- 
pleted. More than one exception can arise at the same time. The ARM architecture 
supports seven types of exception. Table 12.4 lists the types of exception and the 


Table 12.4 ARM Interrupt Vector 


Normal entry 
Exception type address Description 


Reset Supervisor 0x00000000 Occurs when the system is initialized. 


Data abort Abort 0x00000010 Occurs when an invalid memory address 
has been accessed, such as if there is no 
physical memory for an address or the 
correct access permission is lacking. 


FIQ (fast interrupt) 0x0000001C Occurs when an external device asserts 
the FIQ pin on the processor. An inter- 
rupt cannot be interrupted except by an 
FIQ. FIQ is designed to support a data 
transfer or channel process, and has suffi- 
cient private registers to remove the need 
for register saving in such applications, 
therefore minimizing the overhead of 
context switching. A fast interrupt cannot 
be interrupted. 


IRQ (interrupt) 0x00000018 Occurs when an external device asserts 
the IRQ pin on the processor. An interrupt 
cannot be interrupted except by an FIO. 


Prefetch abort 0x0000000C Occurs when an attempt to fetch an in- 
struction results in a memory fault. The 
exception is raised when the instruction 
enters the execute stage of the pipeline. 


Undefined instructions Undefined 0x00000004 Occurs when an instruction not in the in- 
struction set reaches the execute stage of 
the pipeline. 


Software interrupt Supervisor 0x00000008 Generally used to allow user mode pro- 
grams to call the OS. The user program 
executes a SWI instruction with an argu- 
ment that identifies the function the user 
wishes to perform. 
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processor mode that is used to process each type. When an exception occurs, execu- 
tion is forced from a fixed memory address corresponding to the type of exception. 
These fixed addresses are called the exception vectors. 

If more than one interrupt is outstanding, they are handled in priority order. 
Table 12.4 lists the exceptions in priority order, highest to lowest. 

When an exception occurs, the processor halts execution after the current instruc- 
tion. The state of the processor is preserved in the SPSR that corresponds to the type of 
exception, so that the original program can be resumed when the exception routine has 
completed. The address of the instruction the processor was just about to execute is 
placed in the link register of the appropriate processor mode. To return after handling 
the exception, the SPSR is moved into the CPSR and R14 is moved into the PC. 
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12.8 KEY TERMS, REVIEW QUESTIONS, AND PROBLEMS 


Key Terms 


branch prediction flag instruction prefetch 


condition code 
delayed branch instruction pipeline 


instruction cycle program status word (PSW) 


Review Questions 


12.1 What general roles are performed by processor registers? 
12.2 What categories of data are commonly supported by user-visible registers? 
12.3. What is the function of condition codes? 
12.4 What is a program status word? 
12.5 Why is a two-stage instruction pipeline unlikely to cut the instruction cycle time in 
half, compared with the use of no pipeline? 
12.6 List and briefly explain various ways in which an instruction pipeline can deal with 
conditional branch instructions. 
12.7 How are history bits used for branch prediction? 
Problems 
12.1 a. Ifthe last operation performed on a computer with an 8-bit word was an addition 
in which the two operands were 00000010 and 00000011, what would be the value 
of the following flags? 
e Carry 
° Zero 
e Overflow 
e Sign 
e Even Parity 
e Half-Carry 
b. Repeat for the addition of —1 (twos complement) and +1. 
12.2 Repeat Problem 12.1 for the operation A — B, where A contains 11110000 and B 
contains 0010100. 
12.3 A microprocessor is clocked at a rate of 5 GHz. 
a. How long is a clock cycle? 
b. What is the duration of a particular type of machine instruction consisting of three 
clock cycles? 
12.4 A microprocessor provides an instruction capable of moving a string of bytes from 


one area of memory to another. The fetching and initial decoding of the instruction 


12.6 


12.8 


12.9 


12.10 


12.11 


12.12 


12.13 
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takes 10 clock cycles. Thereafter, it takes 15 clock cycles to transfer each byte. The mi- 

croprocessor is clocked at a rate of 10 GHz. 

a. Determine the length of the instruction cycle for the case of a string of 64 bytes. 

b. What is the worst-case delay for acknowledging an interrupt if the instruction is 
noninterruptible? 

c. Repeat part (b) assuming the instruction can be interrupted at the beginning of 
each byte transfer. 


The Intel 8088 consists of a bus interface unit (BIU) and an execution unit (EU), 
which form a 2-stage pipeline. The BIU fetches instructions into a 4-byte instruction 
queue. The BIU also participates in address calculations, fetches operands, and writes 
results in memory as requested by the EU. If no such requests are outstanding and the 
bus is free, the BIU fills any vacancies in the instruction queue. When the EU com- 
pletes execution of an instruction, it passes any results to the BIU (destined for mem- 
ory or I/O) and proceeds to the next instruction. 

a. Suppose the tasks performed by the BIU and EU take about equal time. By what 
factor does pipelining improve the performance of the 8088? Ignore the effect of 
branch instructions. 

b. Repeat the calculation assuming that the EU takes twice as long as the BIU. 


Assume an 8088 is executing a program in which the probability of a program jump is 
0.1. For simplicity, assume that all instructions are 2 bytes long. 

a. What fraction of instruction fetch bus cycles is wasted? 

b. Repeat if the instruction queue is 8 bytes long. 


Consider the timing diagram of Figures 12.10. Assume that there is only a two-stage 
pipeline (fetch, execute). Redraw the diagram to show how many time units are now 
needed for four instructions. 


Assume a pipeline with four stages: fetch instruction (FI), decode instruction and cal- 
culate addresses (DA), fetch operand (FO), and execute (EX). Draw a diagram simi- 
lar to Figures 12.10 for a sequence of 7 instructions, in which the third instruction is a 
branch that is taken and in which there are no data dependencies. 


A pipelined processor has a clock rate of 2.5 GHz and executes a program with 

1.5 million instructions. The pipeline has five stages, and instructions are issued at 

a rate of one per clock cycle. Ignore penalties due to branch instructions and out- 

of-sequence executions. 

a. What is the speedup of this processor for this program compared to a non- 
pipelined processor, making the same assumptions used in Section 12.4? 

b. What is throughput (in MIPS) of the pipelined processor? 

A nonpipelined processor has a clock rate of 2.5 GHz and an average CPI (cycles per 

instruction) of 4. An upgrade to the processor introduces a five-stage pipeline. How- 

ever, due to internal pipeline delays, such as latch delay, the clock rate of the new 

processor has to be reduced to 2 GHz. 

a. What is the speedup achieved for a typical program? 

b. What is the MIPS rate for each processor? 

Consider an instruction sequence of length n that is streaming through the instruction 

pipeline. Let p be the probability of encountering a conditional or unconditional branch 

instruction, and let g be the probability that execution of a branch instruction I causes a 

jump to a nonconsecutive address. Assume that each such jump requires the pipeline to 

be cleared, destroying all ongoing instruction processing, when I emerges from the last 

stage. Revise Equations (12.1) and (12.2) to take these probabilities into account. 

One limitation of the multiple-stream approach to dealing with branches in a pipeline 

is that additional branches will be encountered before the first branch is resolved. 

Suggest two additional limitations or drawbacks. 

Consider the state diagrams of Figure 12.28. 

a. Describe the behavior of each. 

b. Compare these with the branch prediction state diagram in Section 12.4. Discuss 
the relative merits of each of the three approaches to branch prediction. 
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Figure 12.28 Two Branch Prediction State Diagrams 


12.14 The Motorola 680x0 machines include the instruction Decrement and Branch Ac- 


cording to Condition, which has the following form: 
DBcc Dn, displacement 


where cc is one of the testable conditions, Dn is a general-purpose register, and dis- 
placement specifies the target address relative to the current address. The instruction 
can be defined as follows: 


if (cc = False) 
then begin 

Dn := (Dn) — 1; 

if Dn # —1 then PC := (PC) + displacement end 
else PC := (PC) + 2; 


When the instruction is executed, the condition is first tested to determine whether 
the termination condition for the loop is satisfied. If so, no operation is performed and 
execution continues at the next instruction in sequence. If the condition is false, the 
specified data register is decremented and checked to see if it is less than zero. If it is 
less than zero, the loop is terminated and execution continues at the next instruction 
in sequence. Otherwise, the program branches to the specified location. Now consider 
the following assembly-language program fragment: 


AGAIN CMPM.L  (A0)+,(A1)+ 
DBNE D1, AGAIN 
NOP 


Two strings addressed by AO and Al are compared for equality; the string pointers 

are incremented with each reference. D1 initially contains the number of longwords 

(4 bytes) to be compared. 

a. The initial contents of the registers are AO = $00004000, A1 = $00005000, and 
D1 = $000000FF (the $ indicates hexadecimal notation). Memory between $4000 
and $6000 is loaded with words $AAAA. If the foregoing program is run, specify 
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Table 12.5 Branch Behavior in Sample Applications 


Occurrence of branch classes: 

Type 1: Branch 72.5% 
Type 2: Loop control 9.8% 
Type 3: Procedure call,return 17.7% 


Type 1 branch: where it goes Scientific Commercial Systems 
Unconditional—100% go to target 20% 40% 35% 
Conditional—went to target 43.2% 24.3% 32.5% 
Conditional—did not go to target (inline) 36.8% 35.7% 32.5% 


Type 2 branch (all environments) 


That go to target 91% 


That go inline 9% 
Type 3 branch 
100% go to target 


the number of times the DBNE loop is executed and the contents of the three reg- 
isters when the NOP instruction is reached. 

b. Repeat (a), but now assume that memory between $4000 and $4FEE is loaded 
with $0000 and between $5000 and $6000 is loaded with $A AA. 

12.15 Redraw Figures 12.19c, assuming that the conditional branch is not taken. 

12.16 Table 12.5 summarizes statistics from [MACD84] concerning branch behavior for 
various classes of applications. With the exception of type 1 branch behavior, there is 
no noticeable difference among the application classes. Determine the fraction of all 
branches that go to the branch target address for the scientific environment. Repeat 
for commercial and systems environments. 

12.17 Pipelining can be applied within the ALU to speed up floating-point operations. Con- 
sider the case of floating-point addition and subtraction. In simplified terms, the 
pipeline could have four stages: (1) Compare the exponents; (2) Choose the exponent 
and align the significands; (3) Add or subtract significands; (4) Normalize the results. 
The pipeline can be considered to have two parallel threads, one handling exponents 
and one handling significands, and could start out like this: 


Exponents Significands 


a b A B 


In this figure, the boxes labeled R refer to a set of registers used to hold temporary re- 
sults. Complete the block diagram that shows at a top level the structure of the pipeline. 
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KEY POINTS 


Studies of the execution behavior of high-level language programs have 
provided guidance in designing a new type of processor architecture: the 
reduced instruction set computer (RISC). Assignment statements predom- 
inate, suggesting that the simple movement of data should be optimized. 
There are also many IF and LOOP instructions, which suggest that the 
underlying sequence control mechanism needs to be optimized to permit 
efficient pipelining. Studies of operand reference patterns suggest that it 
should be possible to enhance performance by keeping a moderate number 
of operands in registers. 


These studies have motivated the key characteristics of RISC machines: 
(1) a limited instruction set with a fixed format, (2) a large number of regis- 
ters or the use of a compiler that optimizes register usage, and (3) an empha- 
sis on optimizing the instruction pipeline. 

The simple instruction set of a RISC lends itself to efficient pipelining 
because there are fewer and more predictable operations performed per 
instruction. A RISC instruction set architecture also lends itself to the 
delayed branch technique, in which branch instructions are rearranged with 
other instructions to improve pipeline efficiency. 


Since the development of the stored-program computer around 1950, there have been 
remarkably few true innovations in the areas of computer organization and architec- 
ture. The following are some of the major advances since the birth of the computer: 


The family concept: Introduced by IBM with its System/360 in 1964, followed 
shortly thereafter by DEC, with its PDP-8. The family concept decouples the 
architecture of a machine from its implementation. A set of computers is of- 
fered, with different price/performance characteristics, that presents the same 
architecture to the user. The differences in price and performance are due to 
different implementations of the same architecture. 


Microprogrammed control unit: Suggested by Wilkes in 1951 and introduced 
by IBM on the S/360 line in 1964. Microprogramming eases the task of design- 
ing and implementing the control unit and provides support for the family 
concept. 


Cache memory: First introduced commercially on IBM S/360 Model 85 in 
1968. The insertion of this element into the memory hierarchy dramatically 
improves performance. 


Pipelining: A means of introducing parallelism into the essentially sequential 
nature of a machine-instruction program. Examples are instruction pipelining 
and vector processing. 


Multiple processors: This category covers a number of different organizations 
and objectives. 
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e Reduced instruction set computer (RISC) architecture: This is the focus of this 
chapter. 


The RISC architecture is a dramatic departure from the historical trend in 
processor architecture. An analysis of the RISC architecture brings into focus many of 
the important issues in computer organization and architecture. 

Although RISC systems have been defined and designed in a variety of ways by 
different groups, the key elements shared by most designs are these: 


e A large number of general-purpose registers, and/or the use of compiler tech- 
nology to optimize register usage 
e A limited and simple instruction set 
e An emphasis on optimizing the instruction pipeline 
Table 13.1 compares several RISC and non-RISC systems. 
We begin this chapter with a brief survey of some results on instruction sets, and 


then examine each of the three topics just listed. This is followed by a description of two 
of the best-documented RISC designs. 


13.1 INSTRUCTION EXECUTION CHARACTERISTICS 


One of the most visible forms of evolution associated with computers is that of 
programming languages. As the cost of hardware has dropped, the relative cost of 
software has risen. Along with that, a chronic shortage of programmers has driven 
up software costs in absolute terms. Thus, the major cost in the life cycle of a system 
is software, not hardware. Adding to the cost, and to the inconvenience, is the ele- 
ment of unreliability: it is common for programs, both system and application, to 
continue to exhibit new bugs after years of operation. 

The response from researchers and industry has been to develop ever more 
powerful and complex high-level programming languages. These high-level lan- 
guages (HLLs) allow the programmer to express algorithms more concisely, take 
care of much of the detail, and often support naturally the use of structured pro- 
gramming or object-oriented design. 

Alas, this solution gave rise to another problem, known as the semantic gap, 
the difference between the operations provided in HLLs and those provided in 
computer architecture. Symptoms of this gap are alleged to include execution ineffi- 
ciency, excessive machine program size, and compiler complexity. Designers 
responded with architectures intended to close this gap. Key features include large 
instruction sets, dozens of addressing modes, and various HLL statements imple- 
mented in hardware. An example of the latter is the CASE machine instruction on 
the VAX. Such complex instruction sets are intended to 


e Ease the task of the compiler writer. 


e Improve execution efficiency, because complex sequences of operations can 
be implemented in microcode. 


e Provide support for even more complex and sophisticated HLLs. 


€8V 


Table 13.1 Characteristics of Some CISCs, RISCs, and Superscalar Processors 


Complex Instruction Set 
(CISC) Computer 


Reduced Instruction 
Set (RISC) Computer 


Superscalar 


Characteristic 


IBM 
370/168 


VAX 
11/780 


SPARC 


MIPS 
R4000 


PowerPC 


Ultra 
SPARC 


MIPS 
R10000 


Year developed 


1973 


1978 


1987 


1991 


1993 


1996 


1996 


Number of instructions 


303 


94 


R25 


Instruction size (bytes) 


2-57 


4 


4 


Addressing modes 


2D 


1 


2; 


Number of general- 
purpose registers 


16 


32 


ay 


Control memory 
size (Kbits) 


Cache size (KBytes) 
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Meanwhile, a number of studies have been done over the years to determine 
the characteristics and patterns of execution of machine instructions generated 
from HLL programs. The results of these studies inspired some researchers to look 
for a different approach: namely, to make the architecture that supports the HLL 
simpler, rather than more complex. 

To understand the line of reasoning of the RISC advocates, we begin with a 
brief review of instruction execution characteristics. The aspects of computation of 
interest are as follows: 


¢ Operations performed: These determine the functions to be performed by the 
processor and its interaction with memory. 


e Operands used: The types of operands and the frequency of their use deter- 
mine the memory organization for storing them and the addressing modes for 
accessing them. 


e Execution sequencing: This determines the control and pipeline organization. 


In the remainder of this section, we summarize the results of a number of 
studies of high-level-language programs. All of the results are based on dynamic 
measurements. That is, measurements are collected by executing the program and 
counting the number of times some feature has appeared or a particular property 
has held true. In contrast, static measurements merely perform these counts on 
the source text of a program. They give no useful information on performance, 
because they are not weighted relative to the number of times each statement is 
executed. 


Operations 


A variety of studies have been made to analyze the behavior of HLL programs. 
Table 4.8, discussed in Chapter 4, includes key results from a number of studies. 
There is quite good agreement in the results of this mixture of languages and appli- 
cations. Assignment statements predominate, suggesting that the simple movement 
of data is of high importance. There is also a preponderance of conditional state- 
ments (IF, LOOP). These statements are implemented in machine language with 
some sort of compare and branch instruction. This suggests that the sequence con- 
trol mechanism of the instruction set is important. 

These results are instructive to the machine instruction set designer, indicating 
which types of statements occur most often and therefore should be supported in an 
“optimal” fashion. However, these results do not reveal which statements use the 
most time in the execution of a typical program. That is, given a compiled machine- 
language program, which statements in the source language cause the execution of 
the most machine-language instructions? 

To get at this underlying phenomenon, the Patterson programs [PATTS82a], 
described in Appendix 4A, were compiled on the VAX, PDP-11, and Motorola 
68000 to determine the average number of machine instructions and memory ref- 
erences per statement type. The second and third columns in Table 13.2 show the 
relative frequency of occurrence of various HLL instructions in a variety of pro- 
grams; the data were obtained by observing the occurrences in running programs 
rather than just the number of times that statements occur in the source code. 
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Table 13.2 Weighted Relative Dynamic Frequency of HLL Operations [PATT82a] 


Machine-Instruction Memory-Reference 
Dynamic Occurrence Weighted Weighted 


Pascal Pascal Pascal 


ASSIGN 
LOOP 
CALL 


IF 
GOTO 
OTHER 


Hence these are dynamic frequency statistics. To obtain the data in columns four 
and five (machine-instruction weighted), each value in the second and third 
columns is multiplied by the number of machine instructions produced by the com- 
piler. These results are then normalized so that columns four and five show the rel- 
ative frequency of occurrence, weighted by the number of machine instructions per 
HLL statement. Similarly, the sixth and seventh columns are obtained by multiply- 
ing the frequency of occurrence of each statement type by the relative number of 
memory references caused by each statement. The data in columns four through 
seven provide surrogate measures of the actual time spent executing the various 
statement types. The results suggest that the procedure call/return is the most time- 
consuming operation in typical HLL programs. 

The reader should be clear on the significance of Table 13.2. This table indi- 
cates the relative significance of various statement types in an HLL, when that HLL 
is compiled for a typical contemporary instruction set architecture. Some other ar- 
chitecture could conceivably produce different results. However, this study pro- 
duces results that are representative for contemporary complex instruction set 
computer (CISC) architectures. Thus, they can provide guidance to those looking 
for more efficient ways to support HLLs. 


Operands 


Much less work has been done on the occurrence of types of operands, despite the 
importance of this topic. There are several aspects that are significant. 

The Patterson study already referenced [PATT82a] also looked at the dynamic 
frequency of occurrence of classes of variables (Table 13.3). The results, consistent 
between Pascal and C programs, show that the majority of references are to simple 


Table 13.3 Dynamic Percentage of Operands 


Pascal Average 


Integer Constant 16% 20% 
Scalar Variable 58% 55% 
Array/Structure 26% 25% 
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scalar variables. Further, more than 80% of the scalars were local (to the procedure) 
variables. In addition, references to arrays/structures require a previous reference to 
their index or pointer, which again is usually a local scalar. Thus, there is a prepon- 
derance of references to scalars, and these are highly localized. 

The Patterson study examined the dynamic behavior of HLL programs, inde- 
pendent of the underlying architecture. As discussed before, it is necessary to 
deal with actual architectures to examine program behavior more deeply. One 
study, [LUND77], examined DEC-10 instructions dynamically and found that each 
instruction on the average references 0.5 operand in memory and 1.4 registers. Sim- 
ilar results are reported in [HUCK83] for C, Pascal, and FORTRAN programs on 
S/370, PDP-11, and VAX. Of course, these figures depend highly on both the architec- 
ture and the compiler, but they do illustrate the frequency of operand accessing. 

These latter studies suggest the importance of an architecture that lends itself 
to fast operand accessing, because this operation is performed so frequently. The 
Patterson study suggests that a prime candidate for optimization is the mechanism 
for storing and accessing local scalar variables. 


Procedure Calls 


We have seen that procedure calls and returns are an important aspect of HLL pro- 
grams. The evidence (Table 13.2) suggests that these are the most time-consuming 
operations in compiled HLL programs. Thus, it will be profitable to consider ways of 
implementing these operations efficiently. Two aspects are significant: the number 
of parameters and variables that a procedure deals with, and the depth of nesting. 

Tanenbaum’s study [TANE78] found that 98% of dynamically called proce- 
dures were passed fewer than six arguments and that 92% of them used fewer than 
six local scalar variables. Similar results were reported by the Berkeley RISC team 
[KATE83], as shown in Table 13.4. These results show that the number of words re- 
quired per procedure activation is not large. The studies reported earlier indicated 
that a high proportion of operand references is to local scalar variables. These stud- 
ies show that those references are in fact confined to relatively few variables. 

The same Berkeley group also looked at the pattern of procedure calls and 
returns in HLL programs. They found that it is rare to have a long uninterrupted 
sequence of procedure calls followed by the corresponding sequence of returns. 


Table 13.4 Procedure Arguments and Local Scalar Variables 


Percentage of Executed Compiler, Interpreter, Small Nonnumeric 
Procedure Calls With and Typesetter Programs 


>3 arguments 0-7% 0-5% 
>5 arguments 0-3% 0% 
>8 words of arguments and 1-20% 0-6% 
local scalars 
>12 words of arguments and 16% 0-3% 
local scalars 
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Rather, they found that a program remains confined to a rather narrow window of 
procedure-invocation depth. This is illustrated in Figure 4.21, which was discussed in 
Chapter 4. These results reinforce the conclusion that operand references are highly 
localized. 


Implications 


A number of groups have looked at results such as those just reported and have 
concluded that the attempt to make the instruction set architecture close to HLLs 
is not the most effective design strategy. Rather, the HLLs can best be supported 
by optimizing performance of the most time-consuming features of typical HLL 
programs. 

Generalizing from the work of a number of researchers, three elements emerge 
that, by and large, characterize RISC architectures. First, use a large number of reg- 
isters or use a compiler to optimize register usage. This is intended to optimize 
operand referencing. The studies just discussed show that there are several refer- 
ences per HLL instruction and that there is a high proportion of move (assignment) 
statements. This, coupled with the locality and predominance of scalar references, 
suggests that performance can be improved by reducing memory references at the 
expense of more register references. Because of the locality of these references, an 
expanded register set seems practical. 

Second, careful attention needs to be paid to the design of instruction pipelines. 
Because of the high proportion of conditional branch and procedure call instruc- 
tions, a straightforward instruction pipeline will be inefficient. This manifests itself 
as a high proportion of instructions that are prefetched but never executed. 

Finally, a simplified (reduced) instruction set is indicated. This point is not as 
obvious as the others, but should become clearer in the ensuing discussion. 


13.2 THE USE OF A LARGE REGISTER FILE 


The results summarized in Section 13.1 point out the desirability of quick access to 
operands. We have seen that there is a large proportion of assignment statements in 
HLL programs, and many of these are of the simple form A <— B. Also, there is a 
significant number of operand accesses per HLL statement. If we couple these re- 
sults with the fact that most accesses are to local scalars, heavy reliance on register 
storage is suggested. 

The reason that register storage is indicated is that it is the fastest available 
storage device, faster than both main memory and cache. The register file is physi- 
cally small, on the same chip as the ALU and control unit, and employs much shorter 
addresses than addresses for cache and memory. Thus, a strategy is needed that will 
allow the most frequently accessed operands to be kept in registers and to minimize 
register-memory operations. 

Two basic approaches are possible, one based on software and the other on 
hardware. The software approach is to rely on the compiler to maximize register 
usage. The compiler will attempt to allocate registers to those variables that will be 
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used the most in a given time period. This approach requires the use of sophisticated 
program-analysis algorithms. The hardware approach is simply to use more registers 
so that more variables can be held in registers for longer periods of time. 

In this section, we will discuss the hardware approach. This approach has been 
pioneered by the Berkeley RISC group [PATT82a]; was used in the first commercial 
RISC product, the Pyramid [RAGA83]; and is currently used in the popular SPARC 
architecture. 


Register Windows 


On the face of it, the use of a large set of registers should decrease the need to access 
memory. The design task is to organize the registers in such a fashion that this goal is 
realized. 

Because most operand references are to local scalars, the obvious approach is 
to store these in registers, with perhaps a few registers reserved for global variables. 
The problem is that the definition of /ocal changes with each procedure call and re- 
turn, operations that occur frequently. On every call, local variables must be saved 
from the registers into memory, so that the registers can be reused by the called 
program. Furthermore, parameters must be passed. On return, the variables of the 
parent program must be restored (loaded back into registers) and results must be 
passed back to the parent program. 

The solution is based on two other results reported in Section 13.1. First, a typ- 
ical procedure employs only a few passed parameters and local variables (Table 
13.4). Second, the depth of procedure activation fluctuates within a relatively nar- 
row range (Figure 4.21). To exploit these properties, multiple small sets of registers 
are used, each assigned to a different procedure. A procedure call automatically 
switches the processor to use a different fixed-size window of registers, rather than 
saving registers in memory. Windows for adjacent procedures are overlapped to 
allow parameter passing. 

The concept is illustrated in Figure 13.1. At any time, only one window of regis- 
ters is visible and is addressable as if it were the only set of registers (e.g., addresses 0 
through N — 1). The window is divided into three fixed-size areas. Parameter registers 
hold parameters passed down from the procedure that called the current procedure 
and hold results to be passed back up. Local registers are used for local variables, as as- 
signed by the compiler. Temporary registers are used to exchange parameters and re- 
sults with the next lower level (procedure called by current procedure). The temporary 
registers at one level are physically the same as the parameter registers at the next 


Parameter Local Temporary Level J 
registers registers registers 
Call/return 
Parameter Local Temporary 
registers registers registers Level J + 1 


Figure 13.1 Overlapping Register Windows 
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lower level. This overlap permits parameters to be passed without the actual movement 
of data. Keep in mind that, except for the overlap, the registers at two different levels 
are physically distinct. That is, the parameter and local registers at level J are disjoint 
from the local and temporary registers at level J + 1. 

To handle any possible pattern of calls and returns, the number of register win- 
dows would have to be unbounded. Instead, the register windows can be used to hold 
the few most recent procedure activations. Older activations must be saved in memory 
and later restored when the nesting depth decreases. Thus, the actual organization of 
the register file is as a circular buffer of overlapping windows. Two notable examples 
of this approach are Sun’s SPARC architecture, described in Section 13.7, and the 
IA-64 architecture used in Intel’s Itanium processor, described in Chapter 21. 

The circular organization is shown in Figure 13.2, which depicts a circular 
buffer of six windows. The buffer is filled to a depth of 4 (A called B; B called C; C 
called D) with procedure D active. The current-window pointer (CWP) points to the 
window of the currently active procedure. Register references by a machine instruc- 
tion are offset by this pointer to determine the actual physical register. The saved- 
window pointer (SWP) identifies the window most recently saved in memory. If 
procedure D now calls procedure E, arguments for E are placed in D’s temporary 
registers (the overlap between w3 and w4) and the CWP is advanced by one window. 


Restore gaye 


Current 


window 
pointer 
__ ce 


Figure 13.2. Circular-Buffer Organization of Overlapped Windows 
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If procedure E then makes a call to procedure F, the call cannot be made with 
the current status of the buffer. This is because F’s window overlaps A’s window. If F 
begins to load its temporary registers, preparatory to a call, it will overwrite the pa- 
rameter registers of A (A.in). Thus, when CWP is incremented (modulo 6) so that it 
becomes equal to SWP, an interrupt occurs, and A’s window is saved. Only the first 
two portions (A.in and A.loc) need be saved. Then, the SWP is incremented and the 
call to F proceeds. A similar interrupt can occur on returns. For example, subsequent 
to the activation of F, when B returns to A, CWP is decremented and becomes equal 
to SWP. This causes an interrupt that results in the restoration of A’s window. 

From the preceding, it can be seen that an N-window register file can hold 
only N — 1 procedure activations. The value of N need not be large. As was men- 
tioned in Appendix 4A, one study [TAMI83] found that, with 8 windows, a save or 
restore is needed on only 1% of the calls or returns. The Berkeley RISC computers 
use 8 windows of 16 registers each. The Pyramid computer employs 16 windows of 
32 registers each. 


Global Variables 


The window scheme just described provides an efficient organization for storing 
local scalar variables in registers. However, this scheme does not address the need to 
store global variables, those accessed by more than one procedure. Two options sug- 
gest themselves. First, variables declared as global in an HLL can be assigned mem- 
ory locations by the compiler, and all machine instructions that reference these 
variables will use memory-reference operands. This is straightforward, from both 
the hardware and software (compiler) points of view. However, for frequently ac- 
cessed global variables, this scheme is inefficient. 

An alternative is to incorporate a set of global registers in the processor. These 
registers would be fixed in number and available to all procedures. A unified num- 
bering scheme can be used to simplify the instruction format. For example, refer- 
ences to registers 0 through 7 could refer to unique global registers, and references 
to registers 8 through 31 could be offset to refer to physical registers in the current 
window. There is an increased hardware burden to accommodate the split in register 
addressing. In addition, the compiler must decide which global variables should be 
assigned to registers. 


Large Register File versus Cache 


The register file, organized into windows, acts as a small, fast buffer for holding a 
subset of all variables that are likely to be used the most heavily. From this point of 
view, the register file acts much like a cache memory, although a much faster memory. 
The question therefore arises as to whether it would be simpler and better to use a 
cache and a small traditional register file. 

Table 13.5 compares characteristics of the two approaches. The window-based 
register file holds all the local scalar variables (except in the rare case of window over- 
flow) of the most recent N — 1 procedure activations. The cache holds a selection of 
recently used scalar variables. The register file should save time, because all local 
scalar variables are retained. On the other hand, the cache may make more efficient 
use of space, because it is reacting to the situation dynamically. Furthermore, caches 
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Table 13.5 Characteristics of Large-Register-File and Cache Organizations 


Large Register File Cache 


All local scalars Recently-used local scalars 
Individual variables Blocks of memory 
Compiler-assigned global variables Recently-used global variables 


Save/Restore based on procedure Save/Restore based on cache 
nesting depth replacement algorithm 


Register addressing Memory addressing 


generally treat all memory references alike, including instructions and other types of 
data. Thus, savings in these other areas are possible with a cache and not a register file. 

A register file may make inefficient use of space, because not all procedures 
will need the full window space allotted to them. On the other hand, the cache suf- 
fers from another sort of inefficiency: Data are read into the cache in blocks. Whereas 
the register file contains only those variables in use, the cache reads in a block of 
data, some or much of which will not be used. 

The cache is capable of handling global as well as local variables. There are 
usually many global scalars, but only a few of them are heavily used [KATE83]. A 
cache will dynamically discover these variables and hold them. If the window-based 
register file is supplemented with global registers, it too can hold some global 
scalars. However, it is difficult for a compiler to determine which globals will be 
heavily used. 

With the register file, the movement of data between registers and memory is 
determined by the procedure nesting depth. Because this depth usually fluctuates 
within a narrow range, the use of memory is relatively infrequent. Most cache mem- 
ories are set associative with a small set size. Thus, there is the danger that other data 
or instructions will overwrite frequently used variables. 

Based on the discussion so far, the choice between a large window-based reg- 
ister file and a cache is not clear-cut. There is one characteristic, however, in which 
the register approach is clearly superior and which suggests that a cache-based sys- 
tem will be noticeably slower. This distinction shows up in the amount of addressing 
overhead experienced by the two approaches. 

Figure 13.3 illustrates the difference. To reference a local scalar in a window- 
based register file, a “virtual” register number and a window number are used. These 
can pass through a relatively simple decoder to select one of the physical registers. 
To reference a memory location in cache, a full-width memory address must be gen- 
erated. The complexity of this operation depends on the addressing mode. In a set 
associative cache, a portion of the address is used to read a number of words and 
tags equal to the set size. Another portion of the address is compared with the tags, 
and one of the words that were read is selected. It should be clear that even if the 
cache is as fast as the register file, the access time will be considerably longer. Thus, 
from the point of view of performance, the window-based register file is superior for 
local scalars. Further performance improvement could be achieved by the addition 
of a cache for instructions only. 
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Figure 13.3. Referencing a Scalar 


13.3 COMPILER-BASED REGISTER OPTIMIZATION 


Let us assume now that only a small number (e.g., 16-32) of registers is available on 
the target RISC machine. In this case, optimized register usage is the responsibility 
of the compiler. A program written in a high-level language has, of course, no ex- 
plicit references to registers. Rather, program quantities are referred to symbolically. 
The objective of the compiler is to keep the operands for as many computations as 
possible in registers rather than main memory, and to minimize load-and-store 
operations. 

In general, the approach taken is as follows. Each program quantity that is a 
candidate for residing in a register is assigned to a symbolic or virtual register. The 
compiler then maps the unlimited number of symbolic registers into a fixed number 
of real registers. Symbolic registers whose usage does not overlap can share the 
same real register. If, in a particular portion of the program, there are more quanti- 
ties to deal with than real registers, then some of the quantities are assigned to mem- 
ory locations. Load-and-store instructions are used to position quantities in registers 
temporarily for computational operations. 
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The essence of the optimization task is to decide which quantities are to be as- 
signed to registers at any given point in the program. The technique most commonly 
used in RISC compilers is known as graph coloring, which is a technique borrowed 
from the discipline of topology [CHAI82, CHOW86, COUT86, CHOW90]. 

The graph coloring problem is this. Given a graph consisting of nodes and 
edges, assign colors to nodes such that adjacent nodes have different colors, and do 
this in such a way as to minimize the number of different colors. This problem is 
adapted to the compiler problem in the following way. First, the program is analyzed 
to build a register interference graph. The nodes of the graph are the symbolic regis- 
ters. If two symbolic registers are “live” during the same program fragment, then they 
are joined by an edge to depict interference. An attempt is then made to color the 
graph with n colors, where n is the number of registers. Nodes that share the same 
color can be assigned to the same register. If this process does not fully succeed, then 
those nodes that cannot be colored must be placed in memory, and loads and stores 
must be used to make space for the affected quantities when they are needed. 

Figure 13.4 is a simple example of the process. Assume a program with six 
symbolic registers to be compiled into three actual registers. Figure 13.4a shows the 
time sequence of active use of each symbolic register. The dashed horizontal lines 
indicate successive instruction executions. Figure 13.4b shows the register interfer- 
ence graph (shading and cross-hatching are used instead of colors). A possible col- 
oring with three colors is indicated. Because symbolic registers A and D do not 
interfere, the compile can assign both of these to physical register R1. Similarly, sym- 
bolic registers C and E can be assigned to register R3. One symbolic register, F, is 
left uncolored and must be dealt with using loads and stores. 

In general, there is a trade-off between the use of a large set of registers and 
compiler-based register optimization. For example, [BRAD91a] reports on a study 
that modeled a RISC architecture with features similar to the Motorola 88000 and 
the MIPS R2000. The researchers varied the number of registers from 16 to 128, and 
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Figure 13.4 Graph Coloring Approach 
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they considered both the use of all general-purpose registers and registers split 
between integer and floating-point use. Their study showed that with even simple 
register optimization, there is little benefit to the use of more than 64 registers. With 
reasonably sophisticated register optimization techniques, there is only marginal 
performance improvement with more than 32 registers. Finally, they noted that with 
a small number of registers (e.g., 16), a machine with a shared register organization 
executes faster than one with a split organization. Similar conclusions can be drawn 
from [HUGU91], which reports on a study that is primarily concerned with optimiz- 
ing the use of a small number of registers rather than comparing the use of large 
register sets with optimization efforts. 


13.4 REDUCED INSTRUCTION SET ARCHITECTURE 


In this section, we look at some of the general characteristics of and the motivation 
for a reduced instruction set architecture. Specific examples will be seen later in this 
chapter. We begin with a discussion of motivations for contemporary complex in- 
struction set architectures. 


Why CISC 


We have noted the trend to richer instruction sets, which include a larger number of 
instructions and more complex instructions. Two principal reasons have motivated 
this trend: a desire to simplify compilers and a desire to improve performance. 
Underlying both of these reasons was the shift to HLLs on the part of programmers; 
architects attempted to design machines that provided better support for HLLs. 

It is not the intent of this chapter to say that the CISC designers took the 
wrong direction. Indeed, because technology continues to evolve and because archi- 
tectures exist along a spectrum rather than in two neat categories, a black-and-white 
assessment is unlikely ever to emerge. Thus, the comments that follow are simply 
meant to point out some of the potential pitfalls in the CISC approach and to pro- 
vide some understanding of the motivation of the RISC adherents. 

The first of the reasons cited, compiler simplification, seems obvious. The task of 
the compiler writer is to generate a sequence of machine instructions for each HLL 
statement. If there are machine instructions that resemble HLL statements, this task is 
simplified. This reasoning has been disputed by the RISC researchers ([HENN82], 
[RADI83], [PATT82b]). They have found that complex machine instructions are often 
hard to exploit because the compiler must find those cases that exactly fit the con- 
struct. The task of optimizing the generated code to minimize code size, reduce in- 
struction execution count, and enhance pipelining is much more difficult with a 
complex instruction set. As evidence of this, studies cited earlier in this chapter indi- 
cate that most of the instructions in a compiled program are the relatively simple ones. 

The other major reason cited is the expectation that a CISC will yield smaller, 
faster programs. Let us examine both aspects of this assertion: that programs will be 
smaller and that they will execute faster. 

There are two advantages to smaller programs. First, because the program 
takes up less memory, there is a savings in that resource. With memory today being 
so inexpensive, this potential advantage is no longer compelling. More important, 
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Table 13.6 Code Size Relative to RISC I 


[PATT82a] [KATES83] [HEATS84] 
11 C Programs 12 C Programs 5 C Programs 


RISC I 1.0 1.0 1.0 
VAX-11/780 0.8 0.67 

M68000 0.9 0.9 
Z8002 1.2 

PDP-11/70 0.9 


smaller programs should improve performance, and this will happen in two ways. 
First, fewer instructions means fewer instruction bytes to be fetched. Second, in a 
paging environment, smaller programs occupy fewer pages, reducing page faults. 

The problem with this line of reasoning is that it is far from certain that a CISC 
program will be smaller than a corresponding RISC program. In many cases, the CISC 
program, expressed in symbolic machine language, may be shorter (i.e., fewer instruc- 
tions), but the number of bits of memory occupied may not be noticeably smaller. Table 
13.6 shows results from three studies that compared the size of compiled C programs 
on a variety of machines, including RISC I, which has a reduced instruction set archi- 
tecture. Note that there is little or no savings using a CISC over a RISC. It is also inter- 
esting to note that the VAX, which has a much more complex instruction set than the 
PDP-11, achieves very little savings over the latter. These results were confirmed by 
IBM researchers [RADI83], who found that the IBM 801 (a RISC) produced code that 
was 0.9 times the size of code on an IBM S/370. The study used a set of PL/I programs. 

There are several reasons for these rather surprising results. We have already 
noted that compilers on CISCs tend to favor simpler instructions, so that the concise- 
ness of the complex instructions seldom comes into play. Also, because there are 
more instructions on a CISC, longer opcodes are required, producing longer instruc- 
tions. Finally, RISCs tend to emphasize register rather than memory references, and 
the former require fewer bits. An example of this last effect is discussed presently. 

So the expectation that a CISC will produce smaller programs, with the atten- 
dant advantages, may not be realized. The second motivating factor for increasingly 
complex instruction sets was that instruction execution would be faster. It seems to 
make sense that a complex HLL operation will execute more quickly as a single 
machine instruction rather than as a series of more primitive instructions. However, 
because of the bias toward the use of those simpler instructions, this may not be so. 
The entire control unit must be made more complex, and/or the microprogram 
control store must be made larger, to accommodate a richer instruction set. Either 
factor increases the execution time of the simple instructions. 

In fact, some researchers have found that the speedup in the execution of com- 
plex functions is due not so much to the power of the complex machine instructions 
as to their residence in high-speed control store [RADI83]. In effect, the control 
store acts as an instruction cache. Thus, the hardware architect is in the position of 
trying to determine which subroutines or functions will be used most frequently and 
assigning those to the control store by implementing them in microcode. The results 
have been less than encouraging. On S/390 systems, instructions such as Translate 
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and Extended-Precision-Floating-Point-Divide reside in high-speed storage, while 
the sequence involved in setting up procedure calls or initiating an interrupt handler 
are in slower main memory. 

Thus, it is far from clear that a trend to increasingly complex instruction sets is 
appropriate. This has led a number of groups to pursue the opposite path. 


Characteristics of Reduced Instruction Set Architectures 


Although a variety of different approaches to reduced instruction set architecture 
have been taken, certain characteristics are common to all of them: 


e One instruction per cycle 
e Register-to-register operations 
e Simple addressing modes 


e Simple instruction formats 


Here, we provide a brief discussion of these characteristics. Specific examples are 
explored later in this chapter. 

The first characteristic listed is that there is one machine instruction per ma- 
chine cycle. A machine cycle is defined to be the time it takes to fetch two operands 
from registers, perform an ALU operation, and store the result in a register. Thus, 
RISC machine instructions should be no more complicated than, and execute about 
as fast as, microinstructions on CISC machines (discussed in Part Four). With 
simple, one-cycle instructions, there is little or no need for microcode; the machine 
instructions can be hardwired. Such instructions should execute faster than compa- 
rable machine instructions on other machines, because it is not necessary to access a 
microprogram control store during instruction execution. 

A second characteristic is that most operations should be register to register, 
with only simple LOAD and STORE operations accessing memory. This design fea- 
ture simplifies the instruction set and therefore the control unit. For example, a 
RISC instruction set may include only one or two ADD instructions (e.g., integer 
add, add with carry); the VAX has 25 different ADD instructions. Another benefit is 
that such an architecture encourages the optimization of register use, so that fre- 
quently accessed operands remain in high-speed storage. 

This emphasis on register-to-register operations is notable for RISC designs. 
Contemporary CISC machines provide such instructions but also include memory- 
to-memory and mixed register/memory operations. Attempts to compare these ap- 
proaches were made in the 1970s, before the appearance of RISCs. Figure 13.5a 
illustrates the approach taken. Hypothetical architectures were evaluated on pro- 
gram size and the number of bits of memory traffic. Results such as this one led one 
researcher to suggest that future architectures should contain no registers at all 
[MYER78]. One wonders what he would have thought, at the time, of the RISC ma- 
chine once produced by Pyramid, which contained no less than 528 registers! 

What was missing from those studies was a recognition of the frequent access 
to a small number of local scalars and that, with a large bank of registers or an opti- 
mizing compiler, most operands could be kept in registers for long periods of time. 
Thus, Figure 13.5b may be a fairer comparison. 

A third characteristic is the use of simple addressing modes. Almost all RISC 
instructions use simple register addressing. Several additional modes, such as dis- 
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8 16 16 16 8 4 16 
Add B Ee A Load |RB 
i Load |RC 
I= 56D 96, M = 152 Add : RB| RC 
Store | R A 
A 


Register to memory 
I= 104, D = 96, M = 200 


(ay ACB+C 
8 16 16 16 8 4 4 4 
Add B C A Add | RA|RB| RC 
Add Cc B Add |RB| RA| RC 
Sub B D D Sub |RD]| RD] RB 


Memory to memory Register to memory 
I = 168, D = 288, M = 456 I= 60, D = 0,M = 60 


(bs) ACB +C;BCA+C;D¢D-B 


I = number of bytes occupied by executed instructions 
D = number of bytes occupied by data 
M = total memory traffic = I + D 


Figure 13.5 Two Comparisons of Register-to-Register and Memory-to-Memory Approaches 


placement and PC-relative, may be included. Other, more complex modes can be 
synthesized in software from the simple ones. Again, this design feature simplifies 
the instruction set and the control unit. 

A final common characteristic is the use of simple instruction formats. Gener- 
ally, only one or a few formats are used. Instruction length is fixed and aligned on 
word boundaries. Field locations, especially the opcode, are fixed. This design 
feature has a number of benefits. With fixed fields, opcode decoding and register 
operand accessing can occur simultaneously. Simplified formats simplify the control 
unit. Instruction fetching is optimized because word-length units are fetched. Align- 
ment on a word boundary also means that a single instruction does not cross page 
boundaries. 

Taken together, these characteristics can be assessed to determine the poten- 
tial performance benefits of the RISC approach. A certain amount of “circumstan- 
tial evidence” can be presented. First, more effective optimizing compilers can be 
developed. With more-primitive instructions, there are more opportunities for mov- 
ing functions out of loops, reorganizing code for efficiency, maximizing register uti- 
lization, and so forth. It is even possible to compute parts of complex instructions at 
compile time. For example, the S/390 Move Characters (MVC) instruction moves a 
string of characters from one location to another. Each time it is executed, the move 
will depend on the length of the string, whether and in which direction the locations 
overlap, and what the alignment characteristics are. In most cases, these will all be 
known at compile time. Thus, the compiler could produce an optimized sequence of 
primitive instructions for this function. 

A second point, already noted, is that most instructions generated by a compiler 
are relatively simple anyway. It would seem reasonable that a control unit built 
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specifically for those instructions and using little or no microcode could execute 
them faster than a comparable CISC. 

A third point relates to the use of instruction pipelining. RISC researchers feel 
that the instruction pipelining technique can be applied much more effectively with 
a reduced instruction set. We examine this point in some detail presently. 

A final, and somewhat less significant, point is that RISC processors are more 
responsive to interrupts because interrupts are checked between rather elementary 
operations. Architectures with complex instructions either restrict interrupts to in- 
struction boundaries or must define specific interruptible points and implement 
mechanisms for restarting an instruction. 

The case for improved performance for a reduced instruction set architecture is 
strong, but one could perhaps still make an argument for CISC. A number of studies 
have been done but not on machines of comparable technology and power. Further, 
most studies have not attempted to separate the effects of a reduced instruction set and 
the effects of a large register file. The “circumstantial evidence,” however, is suggestive. 


CISC versus RISC Characteristics 


After the initial enthusiasm for RISC machines, there has been a growing realiza- 
tion that (1) RISC designs may benefit from the inclusion of some CISC features 
and that (2) CISC designs may benefit from the inclusion of some RISC features. 
The result is that the more recent RISC designs, notably the PowerPC, are no longer 
“pure” RISC and the more recent CISC designs, notably the Pentium II and later 
Pentium models, do incorporate some RISC characteristics. 

An interesting comparison in [MASH95] provides some insight into this issue. 
Table 13.7 lists a number of processors and compares them across a number of char- 
acteristics. For purposes of this comparison, the following are considered typical of a 
classic RISC: 


1. A single instruction size. 


That size is typically 4 bytes. 


YN 


A small number of data addressing modes, typically less than five. This parameter 
is difficult to pin down. In the table, register and literal modes are not counted 
and different formats with different offset sizes are counted separately. 


4. No indirect addressing that requires you to make one memory access to get the 
address of another operand in memory. 


5. No operations that combine load/store with arithmetic (e.g., add from memory, 
add to memory). 


6. No more than one memory-addressed operand per instruction. 
7. Does not support arbitrary alignment of data for load/store operations. 


8. Maximum number of uses of the memory management unit (MMU) for a data 
address in an instruction. 


9. Number of bits for integer register specifier equal to five or more. This means 
that at least 32 integer registers can be explicitly referenced at a time. 


10. Number of bits for floating-point register specifier equal to four or more. This means 
that at least 16 floating-point registers can be explicitly referenced at a time. 


66r 


Processor 


Number 
of 
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Max 
instruction 
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fable 13.7 Characteristics of Some Processors 
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AMD29000 


A 


no 


no 


no 


MIPS R2000 


no 


no 


no 


SPARC 


no 


no 


no 


MC88000 


no 


no 


HP PA 


no 


no 


IBM RT/PC 


no 


IBM RS/6000 


no 


Intel 1860 


no 


IBM 3090 
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Items 1 through 3 are an indication of instruction decode complexity. Items 4 
through 8 suggest the ease or difficulty of pipelining, especially in the presence of 
virtual memory requirements. Items 9 and 10 are related to the ability to take good 
advantage of compilers. 

In the table, the first eight processors are clearly RISC architectures, the next 
five are clearly CISC, and the last two are processors often thought of as RISC that 
in fact have many CISC characteristics. 


13.5 RISC PIPELINING 


Pipelining with Regular Instructions 


As we discussed in Section 12.4, instruction pipelining is often used to enhance per- 
formance. Let us reconsider this in the context of a RISC architecture. Most instruc- 
tions are register to register, and an instruction cycle has the following two stages: 


e I: Instruction fetch. 
e E: Execute. Performs an ALU operation with register input and output. 


For load and store operations, three stages are required: 


e J: Instruction fetch. 
e E: Execute. Calculates memory address 
e D: Memory. Register-to-memory or memory-to-register operation. 


Figure 13.6a depicts the timing of a sequence of instructions using no pipelin- 
ing. Clearly, this is a wasteful process. Even very simple pipelining can substantially 
improve performance. Figure 13.6b shows a two-stage pipelining scheme, in which 
the I and E stages of two different instructions are performed simultaneously. The 
two stages of the pipeline are an instruction fetch stage, and an execute/memory 
stage that executes the instruction, including register-to-memory and memory- 
to-register operations. Thus we see that the instruction fetch stage of the second 
instruction can e performed in parallel with the first part of the execute/memory 
stage. However, the execute/memory stage of the second instruction must be delayed 
until the first instruction clears the second stage of the pipeline. This scheme can yield 
up to twice the execution rate of a serial scheme. Two problems prevent the maxi- 
mum speedup from being achieved. First, we assume that a single-port memory is 
used and that only one memory access is possible per stage. This requires the inser- 
tion of a wait state in some instructions. Second, a branch instruction interrupts the 
sequential flow of execution. To accommodate this with minimum circuitry, a NOOP 
instruction can be inserted into the instruction stream by the compiler or assembler. 

Pipelining can be improved further by permitting two memory accesses per 
stage. This yields the sequence shown in Figure 13.6c. Now, up to three instructions 
can be overlapped, and the improvement is as much as a factor of 3. Again, branch 
instructions cause the speedup to fall short of the maximum possible. Also, note that 
data dependencies have an effect. If an instruction needs an operand that is altered 
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Store M<rC Store M¢€rC T/IEID 
Branch X Branch X T E 
NOOP TIE 
(a) Sequential execution (b) Two-stage pipelined timing 
Load rAeM IJE|D Load rAeM I |E,|E,|D 
Load 1rB¢M IJE|D Load 1rB¢M I /E,|E,|D 
NOOP I|E NOOP I |E,|E, 
Add rc¢rA + 1B I|E NOOP I |E,|E, 
Store Me rC I/E|D Add = rC<—rA + 1B I |E,|E, 
Branch X I|E Store Me«rC I {E,|E,|D 
NOOP I|E Branch X I |E,|E> 
NOOP 1 |E,|E, 
NOOP T JE, |E> 
(c) Three-stage pipelined timing (d) Four-stage pipelined timing 


Figure 13.6 The Effects of Pipelining 


by the preceding instruction, a delay is required. Again, this can be accomplished by 
a NOOP. 

The pipelining discussed so far works best if the three stages are of approxi- 
mately equal duration. Because the E stage usually involves an ALU operation, it 
may be longer. In this case, we can divide into two substages: 


e E,: Register file read 


e E,: ALU operation and register write 


Because of the simplicity and regularity of a RISC instruction set, the design 
of the phasing into three or four stages is easily accomplished. Figure 13.6d shows 
the result with a four-stage pipeline. Up to four instructions at a time can be under 
way, and the maximum potential speedup is a factor of 4. Note again the use of 
NOOPs to account for data and branch delays. 


Optimization of Pipelining 


Because of the simple and regular nature of RISC instructions, pipelining schemes 
can be efficiently employed. There are few variations in instruction execution dura- 
tion, and the pipeline can be tailored to reflect this. However, we have seen that data 
and branch dependencies reduce the overall execution rate. 


DELAYED BRANCH To compensate for these dependencies, code reorganization tech- 
niques have been developed. First, let us consider branching instructions. Delayed 
branch, a way of increasing the efficiency of the pipeline, makes use of a branch that 
does not take effect until after execution of the following instruction (hence the term 
delayed). The instruction location immediately following the branch is referred to as 
the delay slot. This strange procedure is illustrated in Table 13.8. In the column labeled 
“normal branch,” we see a normal symbolic instruction machine-language program. 
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Table 13.8 Normal and Delayed Branch 


Optimized 
Address Normal Branch Delayed Branch Delayed Branch 


100 LOAD X,rA LOAD X,rA LOAD X,rA 
101 ADD 1,rA ADD 1,rA JUMP 105 
102 JUMP 105 JUMP 106 ADD 1,rA 
103 ADD rA,1rB NOOP ADD rA,rB 
104 SUB rC,rB ADD rA,rB SUB rC,rB 
105 STORE 1A,Z SUB 1rC,rB STORE 1A,Z 
106 STORE rA,Z 


After 102 is executed, the next instruction to be executed is 105. To regularize the 
pipeline, a NOOP is inserted after this branch. However, increased performance is 
achieved if the instructions at 101 and 102 are interchanged. 

Figure 13.7 shows the result. Figure 13.7a shows the traditional approach to 
pipelining, of the type discussed in Chapter 12 (e.g., see Figures 12.11 and 12.12). 


Time 


1 2 3 4 5 6 7 
100 LOAD X, rA I E D 
101 ADD 1, rA I E 
102 JUMP 105 I E 
103 ADD rA, rB I E 
105 STORE rA, Z I E D 
(a) Traditional pipeline 
100 LOAD X, rA I E D 
101 ADD 1, rA I E 
102 JUMP 106 I E 
103 NOOP I E 
106 STORE rA, Z I E D 
(b) RISC pipeline with inserted NOOP 
100 LOAD X, Ar I E D 
101 JUMP 105 I E 
102 ADD 1, rA I E 
105 STORE rA, Z I E D 


(c) Reversed instructions 


Figure 13.7 Use of the Delayed Branch 


13.5 / RISC PIPELINING 503 


The JUMP instruction is fetched at time 3. At time 4, the JUMP instruction is 
executed at the same time that instruction 103 (ADD instruction) is fetched. 
Because a JUMP occurs, which updates the program counter, the pipeline must be 
cleared of instruction 103; at time 5, instruction 105, which is the target of the JUMP, 
is loaded. Figure 13.7b shows the same pipeline handled by a typical RISC organiza- 
tion. The timing is the same. However, because of the insertion of the NOOP in- 
struction, we do not need special circuitry to clear the pipeline; the NOOP simply 
executes with no effect. Figure 13.7c shows the use of the delayed branch. The 
JUMP instruction is fetched at time 2, before the ADD instruction, which is fetched 
at time 3. Note, however, that the ADD instruction is fetched before the execution 
of the JUMP instruction has a chance to alter the program counter. Therefore, dur- 
ing time 4, the ADD instruction is executed at the same time that instruction 105 is 
fetched. Thus, the original semantics of the program are retained but one less clock 
cycle is required for execution. 

This interchange of instructions will work successfully for unconditional 
branches, calls, and returns. For conditional branches, this procedure cannot be 
blindly applied. If the condition that is tested for the branch can be altered by the 
immediately preceding instruction, then the compiler must refrain from doing the 
interchange and instead insert a NOOP. Otherwise, the compiler can seek to insert a 
useful instruction after the branch. The experience with both the Berkeley RISC 
and IBM 801 systems is that the majority of conditional branch instructions can be 
optimized in this fashion ([PATT82a], [RADI83]). 


DELAYED LOAD A similar sort of tactic, called the delayed load, can be used on 
LOAD instructions. On LOAD instructions, the register that is to be the target of 
the load is locked by the processor. The processor then continues execution of 
the instruction stream until it reaches an instruction requiring that register, at 
which point it idles until the load is complete. If the compiler can rearrange 
instructions so that useful work can be done while the load is in the pipeline, effi- 
ciency is increased. 


Loop Unrolling Simulator 


LOOP UNROLLING Another compiler technique to improve instruction parallelism 

is loop unrolling [BACO94]. Unrolling replicates the body of a loop some number of 

times called the unrolling factor (w) and iterates by step u instead of step 1. 
Unrolling can improve the performance by 


reducing loop overhead 


increasing instruction parallelism by improving pipeline performance 


improving register, data cache, or TLB locality 


Figure 13.8 illustrates all three of these improvements in an example. Loop 
overhead is cut in half because two iterations are performed before the test and 
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(a) Original loop 


Glo) =2, m2, 72 
éi(lak]) = gill) + alia] * 
éi[asei || = eis] + alia] 
end do 


if (mod(n-2, 2) = i) then 
a[n-1] = a[n-1] + a[n-2] * 
end if 


(b) Loop unrolled twice 


Figure 13.8 Loop unrolling 


branch at the end of the loop. Instruction parallelism is increased because the sec- 
ond assignment can be performed while the results of the first are being stored and 
the loop variables are being updated. If array elements are assigned to registers, reg- 
ister locality will improve because a[i] and a[i + 1] are used twice in the loop body, 
reducing the number of loads per iteration from three to two. 

As a final note, we should point out that the design of the instruction pipeline 
should not be carried out in isolation from other optimization techniques applied to 
the system. For example, [BRAD91b] shows that the scheduling of instructions for 
the pipeline and the dynamic allocation of registers should be considered together 
to achieve the greatest efficiency. 


13.6 MIPS R4000 


One of the first commercially available RISC chip sets was developed by MIPS 
Technology Inc. The system was inspired by an experimental system, also using the 
name MIPS, developed at Stanford [HENN84]. In this section we look at the MIPS 
R4000. It has substantially the same architecture and instruction set of the earlier 
MIPS designs: the R2000 and R3000. The most significant difference is that the 
R4000 uses 64 rather than 32 bits for all internal and external data paths and for ad- 
dresses, registers, and the ALU. 

The use of 64 bits has a number of advantages over a 32-bit architecture. It 
allows a bigger address space—large enough for an operating system to map more 
than a terabyte of files directly into virtual memory for easy access. With 1-terabyte 
and larger disk drives now common, the 4-gigabyte address space of a 32-bit ma- 
chine becomes limiting. Also, the 64-bit capacity allows the R4000 to process data 
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such as IEEE double-precision floating-point numbers and character strings, up to 
eight characters in a single action. 

The R4000 processor chip is partitioned into two sections, one containing the 
CPU and the other containing a coprocessor for memory management. The proces- 
sor has a very simple architecture. The intent was to design a system in which the in- 
struction execution logic was as simple as possible, leaving space available for logic 
to enhance performance (e.g., the entire memory-management unit). 

The processor supports thirty-two 64-bit registers. It also provides for up to 
128 Kbytes of high-speed cache, half each for instructions and data. The relatively 
large cache (the IBM 3090 provides 128 to 256 Kbytes of cache) enables the system 
to keep large sets of program code and data local to the processor, off-loading the 
main memory bus and avoiding the need for a large register file with the accompa- 
nying windowing logic. 


Instruction Set 


Table 13.9 lists the basic instruction set for all MIPS R series processors. All proces- 
sor instructions are encoded in a single 32-bit word format. All data operations are 
register to register; the only memory references are pure load/store operations. 

The R4000 makes no use of condition codes. If an instruction generates a 
condition, the corresponding flags are stored in a general-purpose register. This 
avoids the need for special logic to deal with condition codes as they affect the 
pipelining mechanism and the reordering of instructions by the compiler. Instead, 
the mechanisms already implemented to deal with register-value dependencies 
are employed. Further, conditions mapped onto the register files are subject to the 
same compile-time optimizations in allocation and reuse as other values stored 
in registers. 

As with most RISC-based machines, the MIPS uses a single 32-bit instruction 
length. This single instruction length simplifies instruction fetch and decode, and it 
also simplifies the interaction of instruction fetch with the virtual memory manage- 
ment unit (i.e., instructions do not cross word or page boundaries). The three in- 
struction formats (Figure 13.9) share common formatting of opcodes and register 
references, simplifying instruction decode. The effect of more complex instructions 
can be synthesized at compile time. 

Only the simplest and most frequently used memory-addressing mode is im- 
plemented in hardware. All memory references consist of a 16-bit offset from a 
32-bit register. For example, the “load word” instruction is of the form 


lw r2, 128(r3) /* load word at address 128 offset from 
register 3 into register 2 


Each of the 32 general-purpose registers can be used as the base register. One register, 
r0, always contains 0. 

The compiler makes use of multiple machine instructions to synthesize 
typical addressing modes in conventional machines. Here is an example from 
[CHOW87], which uses the instruction lui (load upper immediate). This instruction 
loads the upper half of a register with a 16-bit immediate value, setting the lower 
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Description 


) INSTRUCTION SET CO 


Description 


Load/Store Instructions 


Load Byte 

Load Byte Unsigned 
Load Halfword 

Load Halfword Unsigned 
Load Word 

Load Word Left 

Load Word Right 

Store Byte 

Store Halfword 

Store Word 

Store Word Left 

Store Word Right 
Arithmetic Instructions 
(ALU Immediate) 

Add Immediate 

Add Immediate Unsigned 
Set on Less Than Immediate 


Set on Less Than Immediate 
Unsigned 


AND Immediate 

OR Immediate 
Exclusive-OR Immediate 
Load Upper Immediate 
Arithmetic Instructions 
(3-operand, R-type) 

Add 

Add Unsigned 

Subtract 

Subtract Unsigned 

Set on Less Than 

Set on Less Than Unsigned 
AND 

OR 

Exclusive-OR 

NOR 


Shift Instructions 


Shift Left Logical 
Shift Right Logical 
Shift Right Arithmetic 


BLEZ 
BGTZ 
BLITZ 
BGEZ 
BLTZAL 
BGEZAL 


SYSCALL 
BREAK 


Shift Left Logical Variable 
Shift Right Logical Variable 
Shift Right Arithmetic Variable 


Multiply/Divide Instructions 


Multiply 

Multiply Unsigned 
Divide 

Divide Unsigned 
Move From HI 
Move To HI 

Move From LO 
Move To LO 


Jump and Branch Instructions 


Jump 

Jump and Link 

Jump to Register 

Jump and Link Register 
Branch on Equal 
Branch on Not Equal 


Branch on Less Than or Equal to Zero 


Branch on Greater Than Zero 


Branch on Less Than Zero 


Branch on Greater Than or Equal to Zero 
Branch on Less Than Zero And Link 


Branch on Greater Than or Equal to Zero 


And Link 
Coprocessor Instructions 


Load Word to Coprocessor 

Store Word to Coprocessor 
Move To Coprocessor 

Move From Coprocessor 

Move Control To Coprocessor 
Move Control From Coprocessor 
Coprocessor Operation 

Branch on Coprocessor z True 


Branch on Coprocessor z False 
Special Instructions 


System Call 
Break 
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6 5 5 16 
F T-type Operation rs rt Immediate 
(immediate) 
6 26 
J-type Operation Target 
(jump) 
6 5 5 5 5 6 
R-type Operation rs rt rd Shift Function 
(register) 


Operation Operation code 


rs Source register specifier 

rt Source/destination register specifier 
Immediate Immediate, branch, or address displacement 
Target Jump target address 

rd Destination register specifier 

Shift Shift amount 


Function ALU/shift function specifier 


Figure 13.9 MIPS Instruction Formats 
half to zero. Consider an assembly-language instruction that uses a 32-bit immedi- 
ate argument 


lw r2, #imm(r4) /* Jjoad word at address using a 32-bit 
immediate offset #imm 


/* offset from register 4 into register 2 


This instruction can be compiled into the following MIPS instructions 


lui rl, #imm-hi /* where #imm-hi is the high-order 
16 bits of #imm 

addu r1, rl, r4 /* add unsigned #imm-hi to r4 and 
put in rl 


lw r2, #imm-lo(r1) /* where #imm-lo is the low-order 
16 bits of #imm 


Instruction Pipeline 


With its simplified instruction architecture, the MIPS can achieve very efficient 
pipelining. It is instructive to look at the evolution of the MIPS pipeline, as it illus- 
trates the evolution of RISC pipelining in general. 

The initial experimental RISC systems and the first generation of commercial 
RISC processors achieve execution speeds that approach one instruction per system 
clock cycle. To improve on this performance, two classes of processors have evolved 
to offer execution of multiple instructions per clock cycle: superscalar and super- 
pipelined architectures. In essence, a superscalar architecture replicates each of the 
pipeline stages so that two or more instructions at the same stage of the pipeline can 
be processed simultaneously. A superpipelined architecture is one that makes use of 
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more, and more fine-grained, pipeline stages. With more stages, more instructions 
can be in the pipeline at the same time, increasing parallelism. 

Both approaches have limitations. With superscalar pipelining, dependencies 
between instructions in different pipelines can slow down the system. Also, over- 
head logic is required to coordinate these dependencies. With superpipelining, there 
is overhead associated with transferring instructions from one stage to the next. 

Chapter 14 is devoted to a study of superscalar architecture. The MIPS R4000 
is a good example of a RISC-based superpipeline architecture. 


MIPS R3000 Five-Stage Pipeline Simulator 


Figure 13.10a shows the instruction pipeline of the R3000. In the R3000, the 
pipeline advances once per clock cycle. The MIPS compiler is able to reorder in- 
structions to fill delay slots with code 70 to 90% of the time. All instructions follow 
the same sequence of five pipeline stages: 


e Instruction fetch 


e Source operand fetch from register file 


Clock Cycle 
bo | b& b | b& bo | b b | b& b | b& 
IF RD ALU MEM WB 
I-Cache RF ALU OP D-Cache WB 
ITLB IDEC DA DTLB 
IA 
(a) Detailed R3000 pipeline 
IF = Instruction fetch 
RD = Read 
| Cycle | Cycle | Cycle | Cycle | Cycle | Cycle | MEM = Memonpaadedl 
ITLB | I-Cache|RF| ALU | DTLB |D-Cache/WB] = 8) = Write back 
I-Cache = Instruction cache access 
RF = Fetch operand from register 
(b) Modified R3000 pipeline with reduced latencies D-Cache = Data cache access 
ITLB = Instruction address translation 
IDEC = Instruction decode 
Cycle Cycle | Cycle | Cycle | Cycle | IA = Compute instruction address 
DA = Calculate data virtual address 
ITLB |RF| ALU |D-Cache} TC |WB DTLB = Data address translation 
TC = Data cache tag check 


(c) Optimized R3000 pipeline with parallel TLB and cache accesses 
Figure 13.10 Enhancing the R3000 Pipeline 


Table 13. 
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e ALU operation or data operand address generation 
e Data memory reference 
e Write back into register file 


As illustrated in Figure 13.10a, there is not only parallelism due to pipelining but 
also parallelism within the execution of a single instruction. The 60-ns clock cycle is di- 
vided into two 30-ns stages. The external instruction and data access operations to the 
cache each require 60 ns, as do the major internal operations (OP, DA, IA). Instruc- 
tion decode is a simpler operation, requiring only a single 30-ns stage, overlapped with 
register fetch in the same instruction. Calculation of an address for a branch instruc- 
tion also overlaps instruction decode and register fetch, so that a branch at instruction 
ican address the ICACHE access of instruction i + 2. Similarly, a load at instruction i 
fetches data that are immediately used by the OP of instruction i + 1, while an 
ALU/shift result gets passed directly into instruction i + 1 with no delay. This tight 
coupling between instructions makes for a highly efficient pipeline. 

In detail, then, each clock cycle is divided into separate stages, denoted as #1 
and $2. The functions performed in each stage are summarized in Table 13.10. 

The R4000 incorporates a number of technical advances over the R3000. The 
use of more advanced technology allows the clock cycle time to be cut in half, to 30 ns, 
and for the access time to the register file to be cut in half. In addition, there is greater 
density on the chip, which enables the instruction and data caches to be incorporated 
on the chip. Before looking at the final R4000 pipeline, let us consider how the R3000 
pipeline can be modified to improve performance using R4000 technology. 

Figure 13.10b shows a first step. Remember that the cycles in this figure are half 
as long as those in Figure 13.10a. Because they are on the same chip, the instruction 


10 R3000 Pipeline Stages 


Pipeline 
Stage 


Phase Function 


Using the TLB, translate an instruction virtual address to a physical 
address (after a branching decision). 


Send the physical address to the instruction address. 

Return instruction from instruction cache. 

Compare tags and validity of fetched instruction. 

Decode instruction. 

Read register file. 

If branch, calculate branch target address. 

If register-to-register operation, the arithmetic or logical operation is performed. 
Tf a branch, decide whether the branch is to be taken or not. 

If a memory reference (load or store), calculate data virtual address. 

If a memory reference, translate data virtual address to physical using TLB. 
If a memory reference, send physical address to data cache. 

If a memory reference, return data from data cache, and check tags. 


Write to register file. 
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and data cache stages take only half as long; so they still occupy only one clock cycle. 
Again, because of the speedup of the register file access, register read and write still 
occupy only half of a clock cycle. 

Because the R4000 caches are on-chip, the virtual-to-physical address transla- 
tion can delay the cache access. This delay is reduced by implementing virtually 
indexed caches and going to a parallel cache access and address translation. Fig- 
ure 13.10c shows the optimized R3000 pipeline with this improvement. Because of the 
compression of events, the data cache tag check is performed separately on the next 
cycle after cache access. This check determines whether the data item is in the cache. 

In a superpipelined system, existing hardware is used several times per cycle 
by inserting pipeline registers to split up each pipe stage. Essentially, each super- 
pipeline stage operates at a multiple of the base clock frequency, the multiple de- 
pending on the degree of superpipelining. The R4000 technology has the speed and 
density to permit superpipelining of degree 2. Figure 13.11a shows the optimized 
R3000 pipeline using this superpipelining. Note that this is essentially the same dy- 
namic structure as Figure 13.10c. 

Further improvements can be made. For the R4000, a much larger and special- 
ized adder was designed. This makes it possible to execute ALU operations at twice 
the rate. Other improvements allow the execution of loads and stores at twice the 
rate. The resulting pipeline is shown in Figure 13.11b. 

The R4000 has eight pipeline stages, meaning that as many as eight instruc- 
tions can be in the pipeline at the same time. The pipeline advances at the rate of 
two stages per clock cycle. The eight pipeline stages are as follows: 


e Instruction fetch first half: Virtual address is presented to the instruction 
cache and the translation lookaside buffer. 


Clock Cycle 

7 
Icl 1C2 RF ALU | ALU DC1 DC2 TCl TC2 WB 
Icl 1C2 RF ALU | ALU DC1 DC2 TCl TC2 WB 


(a) Superpipelined implmentation of the optimized R3000 pipeline 


Clock Cycle 
>, AC 4), 2} 7) 

IF IS RF EX DF DS TC WB 

IF IS RF EX DF DS TC WB 
(b) R4000 pipeline 

IF = Instruction fetch first half DC = Data cache 
IS = Instruction fetch second half DF = Data cache first half 
RF = Fetch operands from register DS = Data cache second half 
EX = Instruction execute TC = Tag check 
IC = Instruction cache 


Figure 13.11 Theoretical R3000 and Actual R4000 Superpipelines 


13.7/ SPARC 511 


Instruction fetch second half: Instruction cache outputs the instruction and 
the TLB generates the physical address. 


Register file: Three activities occur in parallel: 


e Instruction is decoded and check made for interlock conditions (i.e., this 
instruction depends on the result of a preceding instruction). 


e Instruction cache tag check is made. 
e Operands are fetched from the register file. 
e Instruction execute: One of three activities can occur: 


e If the instruction is a register-to-register operation, the ALU performs the 
arithmetic or logical operation. 


e Ifthe instruction is a load or store, the data virtual address is calculated. 


e Ifthe instruction is a branch, the branch target virtual address is calculated 
and branch conditions are checked. 


Data cache first: Virtual address is presented to the data cache and TLB. 


e Data cache second: The TLB generates the physical address, and the data 
cache outputs the instruction. 


Tag check: Cache tag checks are performed for loads and stores. 
e Write back: Instruction result is written back to register file. 


13.7 SPARC 


SPARC (Scalable Processor Architecture) refers to an architecture defined by Sun 
Microsystems. Sun developed its own SPARC implementation but also licenses the 
architecture to other vendors to produce SPARC-compatible machines. The 
SPARC architecture is inspired by the Berkeley RISC I machine, and its instruction 
set and register organization is based closely on the Berkeley RISC model. 


SPARC Register Set 


As with the Berkeley RISC, the SPARC makes use of register windows. Each win- 
dow consists of 24 registers, and the total number of windows is implementation de- 
pendent and ranges from 2 to 32 windows. Figure 13.12 illustrates an implementation 
that supports 8 windows, using a total of 136 physical registers; as the discussion in 
Section 13.2 indicates, this seems a reasonable number of windows. Physical registers 
0 through 7 are global registers shared by all procedures. Each process sees logical 
registers 0 through 31. Logical registers 24 through 31, referred to as ins, are shared 
with the calling (parent) procedure; and logical registers 8 through 15, referred to as 
outs, are shared with any called (child) procedure. These two portions overlap with 
other windows. Logical registers 16 through 23, referred to as /ocals, are not shared 
and do not overlap with other windows. Again, as the discussion of Section 12.1 indi- 
cates, the availability of 8 registers for parameter passing should be adequate in most 
cases (e.g., see Table 13.4). 
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ie aaa Logical registers 
registers 
Procedure A Procedure B Procedure C 
135 R31, 
: Ins : Ins 
128 R24, 
127 R23, 
: Locals : Locals 
120 R16, 
119 R15, R31, 
: Outs/Ins : Outs : Ins 
112 R8, R24, 
111 R23, 
: Locals : Locals 
104 R16, 
103 R15, R31¢ 
: Outs/Ins : Outs : Ins 
96 R83 R24¢ 
95 R23¢ 
: Locals : Locals 
88 R16¢ 
87 R15¢ 
: Outs : Outs 
80 R8. 
7 R7 R7 R7 
: Globals : Globals : Globals : Globals 
0 RO RO RO 


Figure 13.12 SPARC Register Window Layout with Three Procedures 


Figure 13.13 is another view of the register overlap. The calling procedure 
places any parameters to be passed in its outs registers; the called procedure treats 
these same physical registers as it ins registers. The processor maintains a current 
window pointer (CWP), located in the processor status register (PSR), that points to 
the window of the currently executing procedure. The window invalid mask (WIM), 
also in the PSR, indicates which windows are invalid. 

With the SPARC register architecture, it is usually not necessary to save and 
restore registers for a procedure call. The compiler is simplified because the com- 
piler need be concerned only with allocating the local registers for a procedure in 
an efficient manner and need not be concerned with register allocation between 
procedures. 
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w2 
outs locals 


w6 w2 
locals outs 


Figure 13.13 Eight Register Windows Forming a Circular Stack 
in SPARC 


Instruction Set 


Table 13.11 lists the instructions for the SPARC architecture. Most of the instruc- 
tions reference only register operands. Register-to-register instructions have three 
operands and can be expressed in the form 


Ry <— Rg; op S2 


where R, and Rg, are register references; S2 can refer either to a register or to a 13- 
bit immediate operand. Register zero (Ro) is hardwired with the value 0. This form 
is well suited to typical programs, which have a high proportion of local scalars and 
constants. 

The available ALU operations can be grouped as follows: 


e Integer addition (with or without carry) 

e Integer subtraction (with or without carry) 

e Bitwise Boolean AND, OR, XOR and their negations 
Shift left logical, right logical, or right arithmetic 


Table 13.11 SPARC Instruction Set 


Description Description 


Load/Store Instructions Arithmetic Instructions 
LDSB Load signed byte ADD Add 
LDSH Load signed halfword ADDCC Add, set icc 
LDUB Load unsigned byte ADDX Add with carry 
LDUH Load unsigned halfword ADDXCC Add with carry, set icc 
LD Load word SUB Subtract 
LDD Load doubleword SUBCC Subtract, set icc 
STB Store byte SUBX Subtract with carry 
STH Store halfword SUBXCC Subtract with carry, set icc 
STD Store word MULSCC Multiply step, set icc 
STDD Store doubleword Jump/Branch Instructions 


Shift Instructions BCC Branch on condition 


SLL Shift left logical FBCC Branch on floating-point condition 


SRL Shift right logical CBCC Branch on coprocessor condition 


SRA Shift right arithmetic CALL Call procedure 


Boolean Instructions JMPL Jump and link 
AND AND TCC Trap on condition 
ANDCC AND, set icc SAVE Advance register window 
ANDN NAND RESTORE Move windows backward 
ANDNCC NAND, set icc RETT Return from trap 
OR OR Miscellaneous Instructions 
ORCC OR, set icc SETHI Set high 22 bits 
ORN NOR UNIMP Unimplemented instruction (trap) 
ORNCC NOR, set icc RD Read a special register 
XOR XOR WR Write a special register 
XORCC XOR, set icc IFLUSH Instruction cache flush 
XNOR Exclusive NOR 
XNORCC Exclusive NOR, set icc 


All of these instructions, except the shifts, can optionally set the four condition 
codes (ZERO, NEGATIVE, OVERFLOW, CARRY). Signed integers are repre- 
sented in 32-bit twos complement form. 

Only simple load and store instructions reference memory. There are separate 
load and store instructions for word (32 bits), doubleword, halfword, and byte. For 
the latter two cases, there are instructions for loading these quantities as signed or 
unsigned numbers. Signed numbers are sign extended to fill out the 32-bit destina- 
tion register. Unsigned numbers are padded with zeros. 
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Table 13.12 Synthesizing Other Addressing Modes with SPARC Addressing Modes 


Instruction Type Addressing Mode Algorithm SPARC Equivalent 


Register-to-register Immediate operand = A $2 
Load, store Direct EA=A Ry + S2 
Register-to-register Register EA=R Rg1, Rsz 
Load, store Register Indirect EA = (R) Rs; + 0 
Load, store Displacement EA =(R)+A Rg; + S2 


S2 = either a register operand or a 13-bit immediate operand 


The only available addressing mode, other than register, is a displacement 
mode. That is, the effective address (EA) of an operand consists of a displacement 
from an address contained in a register: 


EA = (Rg) + S2 
or EA = (Rg;) + (Rgo) 


depending on whether the second operand is immediate or a register reference. To 
perform a load or store, an extra stage is added to the instruction cycle. During the 
second stage, the memory address is calculated using the ALU; the load or store oc- 
curs in a third stage. This single addressing mode is quite versatile and can be used to 
synthesize other addressing modes, as indicated in Table 13.12. 

It is instructive to compare the SPARC addressing capability with that of the 
MIPS. The MIPS makes use of a 16-bit offset, compared with a 13-bit offset on the 
SPARC. On the other hand, the MIPS does not permit an address to be constructed 
from the contents of two registers. 


Instruction Format 


As with the MIPS R4000, SPARC uses a simple set of 32-bit instruction formats 
(Figure 13.14). All instructions begin with a 2-bit opcode. For most instructions, this 
is extended with additional opcode bits elsewhere in the format. For the Call in- 
struction, a 30-bit immediate operand is extended with two zero bits to the right to 
form a 32-bit PC-relative address in twos complement form. Instructions are aligned 
on a 32-bit boundary so that this form of addressing suffices. 

The Branch instruction includes a 4-bit condition field that corresponds to the 
four standard condition code bits, so that any combination of conditions can be tested. 
The 22-bit PC-relative address is extended with two zero bits on the right to form a 
24-bit twos complement relative address. An unusual feature of the Branch instruc- 
tion is the annul bit. When the annul bit is not set, the instruction after the branch is 
always executed, regardless of whether the branch is taken. This is the typical de- 
layed branch operation found on many RISC machines and described in Section 
13.5 (see Figure 13.7). However, when the annul bit is set, the instruction following 
the branch is executed only if the branch is taken. The processor suppresses the 
effect of that instruction even though it is already in the pipeline. This annul bit is 
useful because it makes it easier for the compiler to fill the delay slot following a 
conditional branch. The instruction that is the target of the branch can always be 
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2 30 
Call format |Op PC-Relative displacement 
21 #4 3 22 


Branch —|Qp|al Cond | Op2 PC-Relative displacement 
format 


2 5 3 22 
oe Op| Dest | Op2 Immediate constant 
format 
Floating- : = ° 3 2 : 
point Op| Dest Op3 Sre-1 FP-op Sre-2 
format 
2 5 6 5 1 8 5 
Op| Dest Op3 Sre-1  |0 Ignored Sre-2 
General 
formats 
Op| Dest Op3 Sre-1  |1| Immediate constant 


Figure 13.14 SPARC Instruction Formats 


put in the delay slot, because if the branch is not taken, the instruction can be an- 
nulled. The reason this technique is desirable is that conditional branches are gener- 
ally taken more than half the time. 

The SETHI instruction is a special instruction used to load or store a 32-bit 
value. This feature is needed to load and store addresses and large constants. The 
SETHI instruction sets the 22 high-order bits of a register with its 22-bit immediate 
operand, and zeros out the low-order 10 bits. An immediate constant of up to 13 bits 
can be specified in one of the general formats, and such an instruction could be used 
to fill in the remaining 10 bits of the register. A load or store instruction can also be 
used to achieve a direct addressing mode. To load a value from location K in memory, 
we could use the following SPARC instructions: 


sethi %hi(K), %r8 sload high-order 22 bits of address of location 
;K into register r8 
Id [%r8 + %lo(K)], %r8 ;load contents of location K into r8 


The macros %hi and %lo are used to define immediate operands consisting of the 
appropriate address bits of a location. This use of SETHI is similar to the use of the 
lui instruction on the MIPS. 

The floating-point format is used for floating-point operations. Two source and 
one destination registers are designated. 
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Finally, all other operations, including loads, stores, arithmetic, and logical op- 
erations use one of the last two formats shown in Figure 13.14. One of the formats 
makes use of two source registers and a destination register, while the other uses 
one source register, one 13-bit immediate operand, and one destination register. 


13.8 RISC VERSUS CISC CONTROVERSY 


For many years, the general trend in computer architecture and organization has 
been toward increasing processor complexity: more instructions, more address- 
ing modes, more specialized registers, and so on. The RISC movement repre- 
sents a fundamental break with the philosophy behind that trend. Naturally, the 
appearance of RISC systems, and the publication of papers by its proponents ex- 
tolling RISC virtues, led to a reaction from those involved in the design of CISC 
architectures. 

The work that has been done on assessing merits of the RISC approach can be 
grouped into two categories: 


¢ Quantitative: Attempts to compare program size and execution speed of pro- 
grams on RISC and CISC machines that use comparable technology 


¢ Qualitative: Examins issues such as high-level language support and optimum 
use of VLSI real estate 


Most of the work on quantitative assessment has been done by those working 
on RISC systems [PATT82b, HEAT84, PATT84], and it has been, by and large, fa- 
vorable to the RISC approach. Others have examined the issue and come away un- 
convinced [COLW85a, FLYN87, DAVI87]. There are several problems with 
attempting such comparisons [SERL86]: 


e There is no pair of RISC and CISC machines that are comparable in life-cycle 
cost, level of technology, gate complexity, sophistication of compiler, operating 
system support, and so on. 


e No definitive test set of programs exists. Performance varies with the program. 


It is difficult to sort out hardware effects from effects due to skill in compiler 
writing. 

e Most of the comparative analysis on RISC has been done on “toy” machines 
rather than commercial products. Furthermore, most commercially available 
machines advertised as RISC possess a mixture of RISC and CISC character- 
istics. Thus, a fair comparison with a commercial, “pure-play” CISC machine 
(e.g., VAX, Pentium) is difficult. 


The qualitative assessment is, almost by definition, subjective. Several re- 
searchers have turned their attention to such an assessment [COLW85a, WALL85], 
but the results are, at best, ambiguous, and certainly subject to rebuttal [PATT85b] 
and, of course, counterrebuttal [COLW85b]. 

In more recent years, the RISC versus CISC controversy has died down to a 
great extent. This is because there has been a gradual convergence of the technologies. 
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As chip densities and raw hardware speeds increase, RISC systems have become 
more complex. At the same time, in an effort to squeeze out maximum performance, 
CISC designs have focused on issues traditionally associated with RISC, such as an 
increased number of general-purpose registers and increased emphasis on instruction 
pipeline design. 
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13.10 KEY TERMS, REVIEW QUESTIONS, AND PROBLEMS 


Key Terms 


complex instruction set high-level language (HLL) register file 
computer (CISC) reduced instruction set register window 


delayed branch computer (RISC) SPARC 
delayed load 
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Review Questions 


13.1 
13.2 


13.3 


What are some typical distinguishing characteristics of RISC organization? 


Briefly explain the two basic approaches used to minimize register-memory opera- 
tions on RISC machines. 


If a circular register buffer is used to handle local variables for nested procedures, de- 
scribe two approaches for handling global variables. 


What are some typical characteristics of a RISC instruction set architecture? 
What is a delayed branch? 


Problems 


13.1 


Considering the call-return pattern in Figure 4.21, how many overflows and under- 
flows (each of which causes a register save/restore) will occur with a window size of 
a. 5? 

b. 8? 

c. 16? 

In the discussion of Figure 13.2, it was stated that only the first two portions of a win- 
dow are saved or restored. Why is it not necessary to save the temporary registers? 
We wish to determine the execution time for a given program using the various 
pipelining schemes discussed in Section 13.5. Let 


N = number of executed instructions 
D = number of memory accesses 
J = number of jump instructions 


For the simple sequential scheme (Figure 13.6a), the execution time is 2N + D 
stages. Derive formulas for two-stage, three-stage, and four-stage pipelining. 


Reorganize the code sequence in Figure 13.6d to reduce the number of NOOPs. 
Consider the following code fragment in a high-level language: 


for Lin 1...100 loop 
S<S + Q().VAL 
end loop; 


Assume that Q is an array of 32-byte records and the VAL field is in the first 4 bytes 
of each record. Using x86 code, we can compile this program fragment as follows: 


MOV ECX,1 suse register ECX to hold I 
LP: IMUL EAX, ECX, 32 sget offset in EAX 

MOV EBX, OQ[EAX] sload VAL field 

ADD S, EBX ;add to S 

INC ECX sincrement I 

CMP ECX, 101 :compare to 101 

JNE LP sloop until I = 100 


This program makes use of the IMUL instruction, which multiplies the second 
operand by the immediate value in the third operand and places the result in the first 
operand (see Problem 10.13). A RISC advocate would like to demonstrate that a 
clever compiler can eliminate unnecessarily complex instructions such as IMUL. Pro- 
vide the demonstration by rewriting the above x86 program without using the IMUL 
instruction. 


Consider the following loop: 
S:= 0; 
for K := 1 to 100 do 
S:=S-—K; 
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13.7 


13.9 


13.10 


13.11 


A straightforward translation of this into a generic assembly language would look 
something like this: 


LD R1,0 skeep value of S in R1 

LD R2,1 skeep value of K in R2 
LP SUB R1, R1, R2 S:=S-K 

BEQ R2, 100, EXIT ;done if K = 100 

ADD R2, R2,1 else increment K 

JMP LP sback to start of loop 


A compiler for a RISC machine will introduce delay slots into this code so that the 
processor can employ the delayed branch mechanism. The JMP instruction is easy to 
deal with, because this instruction is always followed by the SUB instruction; there- 
fore, we can simply place a copy of the SUB instruction in the delay slot after the JMP. 
The BEQ presents a difficulty. We can’t leave the code as is, because the ADD in- 
struction would then be executed one too many times. Therefore, a NOP instruction is 
needed. Show the resulting code. 

A RISC machine may do both a mapping of symbolic registers to actual registers and 
a rearrangement of instructions for pipeline efficiency. An interesting question arises 
as to the order in which these two operations should be done. Consider the following 
program fragment: 


LD SR1,A sload A into symbolic register 1 

LD SR2,B sload B into symbolic register 2 

ADD — SR3,SR1,SR2 — ;add contents of SR1 and SR2 and store in SR3 
LD SR4, C 

LD SR5,D 


ADD _ SR6,SR4,SR5 


a. First do the register mapping and then any possible instruction reordering. How 
many machine registers are used? Has there been any pipeline improvement? 

b. Starting with the original program, now do instruction reordering and then any 
possible mapping. How many machine registers are used? Has there been any 
pipeline improvement? 

Add entries for the following processors to Table 13.7: 

a. Pentium II 

b. ARM 


In many cases, common machine instructions that are not listed as part of the MIPS 
instruction set can be synthesized with a single MIPS instruction. Show this for the 
following: 

a. Register-to-register move 

b. Increment, decrement 

c. Complement 

d. Negate 

e. Clear 


A SPARC implementation has K register windows. What is the number N of physical 
registers? 

SPARC is lacking a number of instructions commonly found on CISC machines. 
Some of these are easily simulated using either register RO, which is always set to 0, or 
a constant operand. These simulated instructions are called pseudoinstructions and 
are recognized by the SPARC compiler. Show how to simulate the following 
pseudoinstructions, each with a single SPARC instruction. In all of these, sre and dst 
refer to registers. (Hint: A store to RO has no effect.) 

a. MOV src, dst d. NOT dst g. DEC dst 

b. COMPARE srcl, srce2 e. NEG dst h. CLR dst 

c. TEST srel f. INC dst i. NOP 
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13.12 Consider the following code fragment: 


if K > 10 
L:=K+1 

else 
L:=K—-1; 


A straightforward translation of this statement into SPARC assembler could take the 
following form: 


sethi 


Id 
cmp 
ble 
nop 
sethi 
ld 
inc 
sethi 
st 
b 
nop 
Ll: © sethi 
ld 
dec 
sethi 
st 
2: 


*hi(K), %r8 sload high-order 22 bits of address of location 
:K into register r8 
[%r8 + %lo(K)],%r8 — ;load contents of location K into r8 


%r8, 10 ;compare contents of r8 with 10 
Ll sbranch if (r8) = 10 


%hi(K), %r9 

[%r9 + %lo(K)],%r9 —_;load contents of location K into r9 
%r9 sadd 1 to (19) 

%hi(L), %r10 

%r9,[%r10 + %lo(L)] — ;store (19) into location L 

L2 


%hi(K), %rl1 

[%r11 + %lo(K)], %r12_ ;load contents of location K into r12 
%r12 ssubtract 1 from (112) 

%hi(L), %r13 

%r12,[%r13 + %lo(L)] ;store (112) into location L 


The code contains a nop after each branch instruction to permit delayed branch 


operation. 


a. Standard compiler optimizations that have nothing to do with RISC machines are 
generally effective in being able to perform two transformations on the foregoing 
code. Notice that two of the loads are unnecessary and that the two stores can be 
merged if the store is moved to a different place in the code. Show the program 
after making these two changes. 

b. It is now possible to perform some optimizations peculiar to SPARC. The nop 
after the ble can be replaced by moving another instruction into that delay slot 
and setting the annul bit on the ble instruction (expressed as ble,a L1). Show the 
program after this change. 

c. There are now two unnecessary instructions. Remove these and show the result- 
ing program. 
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KEY POINTS 


® A superscalar processor is one in which multiple independent instruction 
pipelines are used. Each pipeline consists of multiple stages, so that each 
pipeline can handle multiple instructions at a time. Multiple pipelines intro- 
duce a new level of parallelism, enabling multiple streams of instructions to 
be processed at a time. A superscalar processor exploits what is known 
as instruction-level parallelism, which refers to the degree to which the 
instructions of a program can be executed in parallel. 


® Asuperscalar processor typically fetches multiple instructions at a time and 
then attempts to find nearby instructions that are independent of one an- 
other and can therefore be executed in parallel. If the input to one instruction 
depends on the output of a preceding instruction, then the latter instruction 
cannot complete execution at the same time or before the former instruc- 
tion. Once such dependencies have been identified, the processor may issue 
and complete instructions in an order that differs from that of the original 
machine code. 


® The processor may eliminate some unnecessary dependencies by the use of 
additional registers and the renaming of register references in the original 
code. 


® Whereas pure RISC processors often employ delayed branches to maximize 
the utilization of the instruction pipeline, this method is less appropriate to 
a superscalar machine. Instead, most superscalar machines use traditional 
branch prediction methods to improve efficiency. 


A superscalar implementation of a processor architecture is one in which common 
instructions—integer and floating-point arithmetic, loads, stores, and conditional 
branches—can be initiated simultaneously and executed independently. Such imple- 
mentations raise a number of complex design issues related to the instruction pipeline. 

Superscalar design arrived on the scene hard on the heels of RISC architecture. 
Although the simplified instruction set architecture of a RISC machine lends itself 
readily to superscalar techniques, the superscalar approach can be used on either a 
RISC or CISC architecture. 

Whereas the gestation period for the arrival of commercial RISC machines from 
the beginning of true RISC research with the IBM 801 and the Berkeley RISC I was 
seven or eight years, the first superscalar machines became commercially available 
within just a year or two of the coining of the term superscalar. The superscalar ap- 
proach has now become the standard method for implementing high-performance 
microprocessors. 

In this chapter, we begin with an overview of the superscalar approach, contrast- 
ing it with superpipelining. Next, we present the key design issues associated with su- 
perscalar implementation. Then we look at several important examples of superscalar 
architecture. 
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14.1 OVERVIEW 


The term superscalar, first coined in 1987 [AGER87], refers to a machine that is de- 
signed to improve the performance of the execution of scalar instructions. In most 
applications, the bulk of the operations are on scalar quantities. Accordingly, the su- 
perscalar approach represents the next step in the evolution of high-performance 
general-purpose processors. 

The essence of the superscalar approach is the ability to execute instructions inde- 
pendently and concurrently in different pipelines. The concept can be further exploited 
by allowing instructions to be executed in an order different from the program order. 
Figure 14.1 shows, in general terms, the superscalar approach. There are multiple func- 
tional units, each of which is implemented as a pipeline, which support parallel execu- 
tion of several instructions. In this example, two integer, two floating-point, and one 
memory (either load or store) operations can be executing at the same time. 

Many researchers have investigated superscalar-like processors, and their re- 
search indicates that some degree of performance improvement is possible. Table 
14.1 presents the reported performance advantages. The differences in the results 
arise from differences both in the hardware of the simulated machine and in the ap- 
plications being simulated. 


Integer register file Floating-point register file 


YN 


Pipelined 
functional 
units 


Memory 


Figure 14.1 General Superscalar Organization 
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Superscalar versus Superpipelined 


An alternative approach to achieving greater performance is referred to as super- 
pipelining, a term first coined in 1988 [JOUP88]. Superpipelining exploits the fact 
that many pipeline stages perform tasks that require less than half a clock cycle. Thus, 
a doubled internal clock speed allows the performance of two tasks in one external 
clock cycle. We have seen one example of this approach with the MIPS R4000. 
Figure 14.2 compares the two approaches. The upper part of the diagram illus- 
trates an ordinary pipeline, used as a base for comparison. The base pipeline issues 
one instruction per clock cycle and can perform one pipeline stage per clock cycle. 
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Figure 14.2 Comparison of Superscalar and Superpipeline Approaches 
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The pipeline has four stages: instruction fetch, operation decode, operation execu- 
tion, and result write back. The execution stage is crosshatched for clarity. Note that 
although several instructions are executing concurrently, only one instruction is in 
its execution stage at any one time. 

The next part of the diagram shows a superpipelined implementation that is 
capable of performing two pipeline stages per clock cycle. An alternative way of 
looking at this is that the functions performed in each stage can be split into two 
nonoverlapping parts and each can execute in half a clock cycle. A superpipeline im- 
plementation that behaves in this fashion is said to be of degree 2. Finally, the lowest 
part of the diagram shows a superscalar implementation capable of executing two 
instances of each stage in parallel. Higher-degree superpipeline and superscalar im- 
plementations are of course possible. 

Both the superpipeline and the superscalar implementations depicted in 
Figure 14.2 have the same number of instructions executing at the same time in the 
steady state. The superpipelined processor falls behind the superscalar processor at 
the start of the program and at each branch target. 


Limitations 


The superscalar approach depends on the ability to execute multiple instructions in par- 
allel. The term instruction-level parallelism refers to the degree to which, on average, the 
instructions of a program can be executed in parallel. A combination of compiler-based 
optimization and hardware techniques can be used to maximize instruction-level paral- 
lelism. Before examining the design techniques used in superscalar machines to increase 
instruction-level parallelism, we need to look at the fundamental limitations to paral- 
lelism with which the system must cope. [JOHN91] lists five limitations: 


e True data dependency 
e Procedural dependency 
e Resource conflicts 

e Output dependency 

e Antidependency 


We examine the first three of these limitations in the remainder of this section. A 
discussion of the last two must await some of the developments in the next section. 


TRUE DATA DEPENDENCY Consider the following sequence:' 


ADD EAX, ECX ;load register EAX with the con- 
tents of ECX plus the contents 
of EAX 

MOV EBX, EAX ;load EBX with the contents of EAX 


The second instruction can be fetched and decoded but cannot execute until the 
first instruction executes. The reason is that the second instruction needs data 


‘For the Intel x86 assembly language, a semicolon starts a comment field. 
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Figure 14.3. Effect of Dependencies 


produced by the first instruction. This situation is referred to as a true data depen- 
dency (also called flow dependency or write after read [WAR] dependency). 

Figure 14.3 illustrates this dependency in a superscalar machine of degree 2. With 
no dependency, two instructions can be fetched and executed in parallel. If there is a 
data dependency between the first and second instructions, then the second instruction 
is delayed as many clock cycles as required to remove the dependency. In general, any 
instruction must be delayed until all of its input values have been produced. 

In a simple pipeline, such as illustrated in the upper part of Figure 14.2, the 
aforementioned sequence of instructions would cause no delay. However, consider 
the following, in which one of the loads is from memory rather than from a register: 


MOV EAX, eff ;load register EAX with the con- 
tents of effective memory ad- 
dress eff 

BX, EAX ;load EBX with the contents of EAX 


MOV 


Ez 
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A typical RISC processor takes two or more cycles to perform a load from 
memory when the load is a cache hit. It can take tens or even hundreds of cycles for 
a cache miss on all cache levels, because of the delay of an off-chip memory access. 
One way to compensate for this delay is for the compiler to reorder instructions so 
that one or more subsequent instructions that do not depend on the memory load 
can begin flowing through the pipeline. This scheme is less effective in the case of a 
superscalar pipeline: The independent instructions executed during the load are 
likely to be executed on the first cycle of the load, leaving the processor with noth- 
ing to do until the load completes. 


PROCEDURAL DEPENDENCIES As was discussed in Chapter 12, the presence of 
branches in an instruction sequence complicates the pipeline operation. The instruc- 
tions following a branch (taken or not taken) have a procedural dependency on the 
branch and cannot be executed until the branch is executed. Figure 14.3 illustrates 
the effect of a branch on a superscalar pipeline of degree 2. 

As we have seen, this type of procedural dependency also affects a scalar 
pipeline. The consequence for a superscalar pipeline is more severe, because a 
greater magnitude of opportunity is lost with each delay. 

If variable-length instructions are used, then another sort of procedural de- 
pendency arises. Because the length of any particular instruction is not known, it 
must be at least partially decoded before the following instruction can be fetched. 
This prevents the simultaneous fetching required in a superscalar pipeline. This is 
one of the reasons that superscalar techniques are more readily applicable to a 
RISC or RISC-like architecture, with its fixed instruction length. 


RESOURCE CONFLICT A resource conflict is a competition of two or more instruc- 
tions for the same resource at the same time. Examples of resources include memo- 
ries, caches, buses, register-file ports, and functional units (e.g., ALU adder). 

In terms of the pipeline, a resource conflict exhibits similar behavior to a data 
dependency (Figure 14.3). There are some differences, however. For one thing, re- 
source conflicts can be overcome by duplication of resources, whereas a true data 
dependency cannot be eliminated. Also, when an operation takes a long time to com- 
plete, resource conflicts can be minimized by pipelining the appropriate functional unit. 


14.2 DESIGN ISSUES 


Instruction-Level Parallelism and Machine Parallelism 


[JOUP89a] makes an important distinction between the two related concepts of 
instruction-level parallelism and machine parallelism. Instruction-level parallelism 
exists when instructions in a sequence are independent and thus can be executed in 
parallel by overlapping. 

As an example of the concept of instruction-level parallelism, consider the fol- 
lowing two code fragments [JOUP89b]: 


Load Rl ¢< R2 Add R3 < R3, “1” 
Add R3 < R3, “1” Add R4 ¢ R3, R2 
Add R4 < R4, R2 Store [R4] < RO 


14.2 / DESIGN ISSUES 529 


The three instructions on the left are independent, and in theory all three could be 
executed in parallel. In contrast, the three instructions on the right cannot be exe- 
cuted in parallel because the second instruction uses the result of the first, and the 
third instruction uses the result of the second. 

The degree of instruction-level parallelism is determined by the frequency of 
true data dependencies and procedural dependencies in the code. These factors, in 
turn, are dependent on the instruction set architecture and on the application. 
Instruction-level parallelism is also determined by what [JOUP89a] refers to as op- 
eration latency: the time until the result of an instruction is available for use as an 
operand in a subsequent instruction. The latency determines how much of a delay a 
data or procedural dependency will cause. 

Machine parallelism is a measure of the ability of the processor to take advan- 
tage of instruction-level parallelism. Machine parallelism is determined by the num- 
ber of instructions that can be fetched and executed at the same time (the number 
of parallel pipelines) and by the speed and sophistication of the mechanisms that 
the processor uses to find independent instructions. 

Both instruction-level and machine parallelism are important factors in en- 
hancing performance. A program may not have enough instruction-level parallelism 
to take full advantage of machine parallelism. The use of a fixed-length instruction 
set architecture, as in a RISC, enhances instruction-level parallelism. On the other 
hand, limited machine parallelism will limit performance no matter what the nature 
of the program. 


Instruction Issue Policy 


As was mentioned, machine parallelism is not simply a matter of having multiple in- 
stances of each pipeline stage. The processor must also be able to identify instruction- 
level parallelism and orchestrate the fetching, decoding, and execution of instructions 
in parallel. [JOHN91] uses the term instruction issue to refer to the process of initi- 
ating instruction execution in the processor’s functional units and the term 
instruction issue policy to refer to the protocol used to issue instructions. In general, 
we can say that instruction issue occurs when instruction moves from the decode 
stage of the pipeline to the first execute stage of the pipeline. 

In essence, the processor is trying to look ahead of the current point of execu- 
tion to locate instructions that can be brought into the pipeline and executed. Three 
types of orderings are important in this regard: 


e The order in which instructions are fetched 
e The order in which instructions are executed 


e The order in which instructions update the contents of register and memory 
locations 


The more sophisticated the processor, the less it is bound by a strict relation- 
ship between these orderings. To optimize utilization of the various pipeline ele- 
ments, the processor will need to alter one or more of these orderings with respect 
to the ordering to be found in a strict sequential execution. The one constraint on 
the processor is that the result must be correct. Thus, the processor must accommo- 
date the various dependencies and conflicts discussed earlier. 
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In general terms, we can group superscalar instruction issue policies into the 
following categories: 


e In-order issue with in-order completion 
e In-order issue with out-of-order completion 
e Out-of-order issue with out-of-order completion 


IN-ORDER ISSUE WITH IN-ORDER COMPLETION The simplest instruction issue 
policy is to issue instructions in the exact order that would be achieved by sequential 
execution (in-order issue) and to write results in that same order (in-order comple- 
tion). Not even scalar pipelines follow such a simple-minded policy. However, it is use- 
ful to consider this policy as a baseline for comparing more sophisticated approaches. 

Figure 14.4a gives an example of this policy. We assume a superscalar pipeline 
capable of fetching and decoding two instructions at a time, having three separate 
functional units (e.g., two integer arithmetic and one floating-point arithmetic), and 
having two instances of the write-back pipeline stage. The example assumes the fol- 
lowing constraints on a six-instruction code fragment: 


e I1 requires two cycles to execute. 

e 13 and 14 conflict for the same functional unit. 
e I5 depends on the value produced by I4. 

e I5 and 16 conflict for a functional unit. 


Instructions are fetched two at a time and passed to the decode unit. Because 
instructions are fetched in pairs, the next two instructions must wait until the pair of 
decode pipeline stages has cleared. To guarantee in-order completion, when there is 
a conflict for a functional unit or when a functional unit requires more than one 
cycle to generate a result, the issuing of instructions temporarily stalls. 

In this example, the elapsed time from decoding the first instruction to writing 
the last results is eight cycles. 


IN-ORDER ISSUE WITH OUT-OF-ORDER COMPLETION Out-of-order completion 
is used in scalar RISC processors to improve the performance of instructions that 
require multiple cycles. Figure 14.4b illustrates its use on a superscalar processor. In- 
struction [2 is allowed to run to completion prior to I1. This allows I3 to be completed 
earlier, with the net result of a savings of one cycle. 

With out-of-order completion, any number of instructions may be in the exe- 
cution stage at any one time, up to the maximum degree of machine parallelism 
across all functional units. Instruction issuing is stalled by a resource conflict, a data 
dependency, or a procedural dependency. 

In addition to the aforementioned limitations, a new dependency, which we re- 
ferred to earlier as an output dependency (also called write after write (WAW) depen- 
dency), arises. The following code fragment illustrates this dependency (op represents 
any operation): 


I1: R3 © R3 op R5 
I2: R4 ¢& RB +1 
I3: R38 ¢ R5+1 
I4: R7 © R3 op R4 
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Decode Execute Write Cycle 
Il 12 1 
13 14 Il 12 2 
13 14 Il 3 
14 13 Il | 2 4 
15 16 14 5 
16 15 I3 | 14 6 
16 7 
I5 | 16 8 
(a) In-order issue and in-order completion 
Decode Execute Write Cycle 
Il 12 1 
13 14 Il 12 2 
14 Il 13 12 3 
15 16 14 1 | 13 4 
16 15 14 5 
16 15 6 
16 7 
(b) In-order issue and out-of-order completion 
Decode Window Execute Write 
Il 12 
13 14 TH, 12 Il 12 
15 16 13, 14 Il 13 12 
14, 15, 16 16 14 ll | 3 
15 15 I4 | 16 
15 


(c) Out-of-order issue and out-of-order completion 


Figure 14.4 Superscalar Instruction Issue and Completion Policies 


Instruction I2 cannot execute before instruction I1, because it needs the result 


in register R3 produced in I1; this is an example of a true data dependency, as de- 
scribed in Section 14.1. Similarly, 14 must wait for I3, because it uses a result pro- 
duced by I3. What about the relationship between I1 and I3? There is no data 
dependency here, as we have defined it. However, if I3 executes to completion prior 
to I1, then the wrong value of the contents of R3 will be fetched for the execution of 
14. Consequently, I3 must complete after I1 to produce the correct output values. To 
ensure this, the issuing of the third instruction must be stalled if its result might later 


be overwritten by an older instruction that takes longer to complete. 
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Out-of-order completion requires more complex instruction issue logic than 
in-order completion. In addition, it is more difficult to deal with instruction inter- 
rupts and exceptions. When an interrupt occurs, instruction execution at the current 
point is suspended, to be resumed later. The processor must assure that the resump- 
tion takes into account that, at the time of interruption, instructions ahead of the in- 
struction that caused the interrupt may already have completed. 


OUT-OF-ORDER ISSUE WITH OUT-OF-ORDER COMPLETION With in-order 
issue, the processor will only decode instructions up to the point of a dependency or 
conflict. No additional instructions are decoded until the conflict is resolved. As a 
result, the processor cannot look ahead of the point of conflict to subsequent in- 
structions that may be independent of those already in the pipeline and that may be 
usefully introduced into the pipeline. 

To allow out-of-order issue, it is necessary to decouple the decode and execute 
stages of the pipeline. This is done with a buffer referred to as an instruction window. 
With this organization, after a processor has finished decoding an instruction, it is placed 
in the instruction window. As long as this buffer is not full, the processor can continue to 
fetch and decode new instructions. When a functional unit becomes available in the ex- 
ecute stage, an instruction from the instruction window may be issued to the execute 
stage. Any instruction may be issued, provided that (1) it needs the particular functional 
unit that is available, and (2) no conflicts or dependencies block this instruction. 

The result of this organization is that the processor has a lookahead capability, 
allowing it to identify independent instructions that can be brought into the execute 
stage. Instructions are issued from the instruction window with little regard for their 
original program order. As before, the only constraint is that the program execution 
behaves correctly. 

Figures 14.4c illustrates this policy. During each of the first three cycles, two in- 
structions are fetched into the decode stage. During each cycle, subject to the constraint 
of the buffer size, two instructions move from the decode stage to the instruction win- 
dow. In this example, it is possible to issue instruction 16 ahead of I5 (recall that I5 de- 
pends on 14, but I6 does not). Thus, one cycle is saved in both the execute and 
write-back stages, and the end-to-end savings, compared with Figure 14.4b, is one cycle. 

The instruction window is depicted in Figure 14.4c to illustrate its role. How- 
ever, this window is not an additional pipeline stage. An instruction being in the win- 
dow simply implies that the processor has sufficient information about that instruction 
to decide when it can be issued. 

The out-of-order issue, out-of-order completion policy is subject to the same con- 
straints described earlier. An instruction cannot be issued if it violates a dependency or 
conflict. The difference is that more instructions are available for issuing, reducing the 
probability that a pipeline stage will have to stall. In addition, a new dependency, which 
we referred to earlier as an antidependency (also called read after write (RAW) de- 
pendency), arises. The code fragment considered earlier illustrates this dependency: 


I1: R3 © R3 op R5 
I2: R4 ¢ RB +1 
I3: RB ¢ R5+1 
I4: R7 © R3 op R4 
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Instruction I3 cannot complete execution before instruction I2 begins execu- 
tion and has fetched its operands. This is so because I3 updates register R3, which is 
a source operand for I2. The term antidependency is used because the constraint is 
similar to that of a true data dependency, but reversed: Instead of the first instruc- 
tion producing a value that the second instruction uses, the second instruction de- 
stroys a value that the first instruction uses. 


Reorder Buffer Simulator 
Tomasulo’s Algorithm Simulator 
Alternative Simulation of Tomasulo’s Algorithm 


One common technique that is used to support out-of-order completion is the 
reorder buffer. The reorder buffer is temporary storage for results completed out of 
order that are then committed to the register file in program order. A related con- 
cept is Tomasulo’s algorithm. Appendix I examines these concepts. 


Register Renaming 


When out-of-order instruction issuing and/or out-of-order instruction completion 
are allowed, we have seen that this gives rise to the possibility of WAW dependen- 
cies and WAR dependencies. These dependencies differ from RAW data dependen- 
cies and resource conflicts, which reflect the flow of data through a program and the 
sequence of execution. WAW dependencies and WAR dependencies, on the other 
hand, arise because the values in registers may no longer reflect the sequence of val- 
ues dictated by the program flow. 

When instructions are issued in sequence and complete in sequence, it is possi- 
ble to specify the contents of each register at each point in the execution. When out- 
of-order techniques are used, the values in registers cannot be fully known at each 
point in time just from a consideration of the sequence of instructions dictated by 
the program. In effect, values are in conflict for the use of registers, and the proces- 
sor must resolve those conflicts by occasionally stalling a pipeline stage. 

Antidependencies and output dependencies are both examples of storage con- 
flicts. Multiple instructions are competing for the use of the same register locations, 
generating pipeline constraints that retard performance. The problem is made more 
acute when register optimization techniques are used (as discussed in Chapter 13), 
because these compiler techniques attempt to maximize the use of registers, hence 
maximizing the number of storage conflicts. 

One method for coping with these types of storage conflicts is based on a 
traditional resource-conflict solution: duplication of resources. In this context, 
the technique is referred to as register renaming. In essence, registers are allocat- 
ed dynamically by the processor hardware, and they are associated with the val- 
ues needed by instructions at various points in time. When a new register value is 
created (i.e., when an instruction executes that has a register as a destination 
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operand), a new register is allocated for that value. Subsequent instructions that 
access that value as a source operand in that register must go through a renaming 
process: the register references in those instructions must be revised to refer to the 
register containing the needed value. Thus, the same original register reference in 
several different instructions may refer to different actual registers, if different 
values are intended. 

Let us consider how register renaming could be used on the code fragment we 
have been examining: 


I1: R3, < R3, op R5, 
T2: R4n <— R3, + 1 
T3: R3, — R5, +1 
T4: R7p — R3, op R4p 


The register reference without the subscript refers to the logical register refer- 
ence found in the instruction. The register reference with the subscript refers to a 
hardware register allocated to hold a new value. When a new allocation is made for 
a particular logical register, subsequent instruction references to that logical register 
as a source operand are made to refer to the most recently allocated hardware reg- 
ister (recent in terms of the program sequence of instructions). 

In this example, the creation of register R3, in instruction I3 avoids the RAW 
dependency on the second instruction and the output dependency on the first in- 
struction, and it does not interfere with the correct value being accessed by I4. The 
result is that I3 can be issued immediately; without renaming, I3 cannot be issued 
until the first instruction is complete and the second instruction is issued. 


Scoreboarding Simulator 


An alternative to register renaming is a scoreboarding. In essence, scoreboard- 
ing is a bookkeeping technique that allows instructions to execute whenever they 
are not dependent on previous instructions and no structural hazards are present. 
See Appendix I for a discussion. 


Machine Parallelism 


In the preceding, we have looked at three hardware techniques that can be used 
in a superscalar processor to enhance performance: duplication of resources, out- 
of-order issue, and renaming. One study that illuminates the relationship among 
these techniques was reported in [SMIT89]. The study made use of a simulation 
that modeled a machine with the characteristics of the MIPS R2000, augmented 
with various superscalar features. A number of different program sequences were 
simulated. 
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Figure 14.5 Speedups of Various Machine Organizations without Procedural Dependencies 


Figure 14.5 shows the results. In each of the graphs, the vertical axis corre- 
sponds to the mean speedup of the superscalar machine over the scalar machine. The 
horizontal axis shows the results for four alternative processor organizations. 
The base machine does not duplicate any of the functional units, but it can issue in- 
structions out of order. The second configuration duplicates the load/store functional 
unit that accesses a data cache. The third configuration duplicates the ALU, and the 
fourth configuration duplicates both load/store and ALU. In each graph, results are 
shown for instruction window sizes of 8, 16, and 32 instructions, which dictates 
the amount of lookahead the processor can do. The difference between the two 
graphs is that, in the second, register renaming is allowed. This is equivalent to saying 
that the first graph reflects a machine that is limited by all dependencies, whereas the 
second graph corresponds to a machine that is limited only by true dependencies. 

The two graphs, combined, yield some important conclusions. The first is that 
it is probably not worthwhile to add functional units without register renaming. 
There is some slight improvement in performance, but at the cost of increased 
hardware complexity. With register renaming, which eliminates antidependencies 
and output dependencies, noticeable gains are achieved by adding more functional 
units. Note, however, that there is a significant difference in the amount of gain 
achievable between using an instruction window of 8 versus a larger instruction 
window. This indicates that if the instruction window is too small, data dependen- 
cies will prevent effective utilization of the extra functional units; the processor 
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must be able to look quite far ahead to find independent instructions to utilize the 
hardware more fully. 


Pipeline with Static Vs. Dynamic Scheduling — Simulator 


Branch Prediction 


Any high-performance pipelined machine must address the issue of dealing with 
branches. For example, the Intel 80486 addressed the problem by fetching both the 
next sequential instruction after a branch and speculatively fetching the branch tar- 
get instruction. However, because there are two pipeline stages between prefetch 
and execution, this strategy incurs a two-cycle delay when the branch gets taken. 

With the advent of RISC machines, the delayed branch strategy was ex- 
plored. This allows the processor to calculate the result of conditional branch in- 
structions before any unusable instructions have been prefetched. With this 
method, the processor always executes the single instruction that immediately 
follows the branch. This keeps the pipeline full while the processor fetches a new 
instruction stream. 

With the development of superscalar machines, the delayed branch strategy 
has less appeal. The reason is that multiple instructions need to execute in the delay 
slot, raising several problems relating to instruction dependencies. Thus, superscalar 
machines have returned to pre-RISC techniques of branch prediction. Some, like 
the PowerPC 601, use a simple static branch prediction technique. More sophisticat- 
ed processors, such as the PowerPC 620 and the Pentium 4, use dynamic branch pre- 
diction based on branch history analysis. 


Superscalar Execution 


We are now in a position to provide an overview of superscalar execution of pro- 
grams; this is illustrated in Figure 14.6. The program to be executed consists of a lin- 
ear sequence of instructions. This is the static program as written by the programmer 
or generated by the compiler. The instruction fetch process, which includes branch 
prediction, is used to form a dynamic stream of instructions. This stream is examined 
for dependencies, and the processor may remove artificial dependencies. The proces- 
sor then dispatches the instructions into a window of execution. In this window, 
instructions no longer form a sequential stream but are structured according to their 
true data dependencies. The processor performs the execution stage of each instruc- 
tion in an order determined by the true data dependencies and hardware resource 
availability. Finally, instructions are conceptually put back into sequential order and 
their results are recorded. 

The final step mentioned in the preceding paragraph is referred to as 
committing, or retiring, the instruction. This step is needed for the following reason. 
Because of the use of parallel, multiple pipelines, instructions may complete in an 
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Figure 14.6 Conceptual Depiction of Superscalar Processing 


order different from that shown in the static program. Further, the use of branch 
prediction and speculative execution means that some instructions may complete 
execution and then must be abandoned because the branch they represent is not 
taken. Therefore, permanent storage and program-visible registers cannot be updat- 
ed immediately when instructions complete execution. Results must be held in some 
sort of temporary storage that is usable by dependent instructions and then made 
permanent when it is determined that the sequential model would have executed 
the instruction. 


Superscalar Implementation 


Based on our discussion so far, we can make some general comments about the 
processor hardware required for the superscalar approach. [SMIT95] lists the fol- 
lowing key elements: 


e Instruction fetch strategies that simultaneously fetch multiple instructions, 
often by predicting the outcomes of, and fetching beyond, conditional branch 
instructions. These functions require the use of multiple pipeline fetch and 
decode stages, and branch prediction logic. 


e Logic for determining true dependencies involving register values, and mech- 
anisms for communicating these values to where they are needed during 
execution. 

e Mechanisms for initiating, or issuing, multiple instructions in parallel. 


e Resources for parallel execution of multiple instructions, including multiple 
pipelined functional units and memory hierarchies capable of simultaneously 
servicing multiple memory references. 


e Mechanisms for committing the process state in correct order. 
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14.3 PENTIUM 4 


Although the concept of superscalar design is generally associated with the RISC ar- 
chitecture, the same superscalar principles can be applied to a CISC machine. Perhaps 
the most notable example of this is the Pentium. The evolution of superscalar concepts 
in the Intel line is interesting to note. The 386 is a traditional CISC nonpipelined ma- 
chine. The 486 introduced the first pipelined x86 processor, reducing the average latency 
of integer operations from between two and four cycles to one cycle, but still limited to 
executing a single instruction each cycle, with no superscalar elements. The original 
Pentium had a modest superscalar component, consisting of the use of two separate in- 
teger execution units. The Pentium Pro introduced a full-blown superscalar design. 
Subsequent Pentium models have refined and enhanced the superscalar design. 

A general block diagram of the Pentium 4 was shown in Figure 4.18. Figure 
14.7 depicts the same structure in a way more suitable for the pipeline discussion in 
this section. The operation of the Pentium 4 can be summarized as follows: 


1. The processor fetches instructions from memory in the order of the static 
program. 

2. Each instruction is translated into one or more fixed-length RISC instructions, 
known as micro-operations, or micro-ops. 
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Figure 14.7. Pentium 4 Block Diagram 
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Figure 14.8 Pentium 4 Pipeline 


3. The processor executes the micro-ops on a superscalar pipeline organization, so 
that the micro-ops may be executed out of order. 


4. The processor commits the results of each micro-op execution to the proces- 
sor’s register set in the order of the original program flow. 


In effect, the Pentium 4 architecture consists of an outer CISC shell with an 
inner RISC core. The inner RISC micro-ops pass through a pipeline with at least 
20 stages (Figure 14.8); in some cases, the micro-op requires multiple execution 
stages, resulting in an even longer pipeline. This contrasts with the five-stage 
pipeline (Figure 12.21) used on the earlier Intel x86 processors and on the Pentium. 

We now trace the operation of the Pentium 4 pipeline, using Figure 14.9 to 
illustrate its operation. 


Front End 


GENERATION OF MICRO-OPS The Pentium 4 organization includes an in-order 
front end (Figure 14.9a) that can be considered outside the scope of the pipeline de- 
picted in Figure 14.8. This front end feeds into an L1 instruction cache, called the 
trace cache, which is where the pipeline proper begins. Usually, the processor oper- 
ates from the trace cache; when a trace cache miss occurs, the in-order front end 
feeds new instructions into the trace cache. 

With the aid of the branch target buffer and the instruction lookaside buffer 
(BTB & I-TLB), the fetch/decode unit fetches Pentium 4 machine instructions from 
the L2 cache 64 bytes at a time. As a default, instructions are fetched sequentially, so 
that each L2 cache line fetch includes the next instruction to be fetched. Branch pre- 
diction via the BTB & I-TLB unit may alter this sequential fetch operation. The 
ITLB translates the linear instruction pointer address given it into physical address- 
es needed to access the L2 cache. Static branch prediction in the front-end BTB is 
used to determine which instructions to fetch next. 

Once instructions are fetched, the fetch/decode unit scans the bytes to deter- 
mine instruction boundaries; this is a necessary operation because of the variable 
length of x86 instructions. The decoder translates each machine instruction into 
from one to four micro-ops, each of which is a 118-bit RISC instruction. Note for 
comparison that most pure RISC machines have an instruction length of just 32 bits. 
The longer micro-op length is required to accommodate the more complex Pentium 
operations. Nevertheless, the micro-ops are easier to manage than the original in- 
structions from which they derive. 

The generated micro-ops are stored in the trace cache. 
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Figure 14.9 Pentium Pipeline Operation 


TRACE CACHE NEXT INSTRUCTION POINTER The first two pipeline stages 
(Figure 14.9b) deal with the selection of instructions in the trace cache and in- 
volve a separate branch prediction mechanism from that described in the previ- 
ous section. The Pentium 4 uses a dynamic branch prediction strategy based on 
the history of recent executions of branch instructions. A branch target buffer 
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Figure 14.9 Continued 


(BTB) is maintained that caches information about recently encountered branch 
instructions. Whenever a branch instruction is encountered in the instruction 
stream, the BTB is checked. If an entry already exists in the BTB, then the in- 
struction unit is guided by the history information for that entry in determining 
whether to predict that the branch is taken. If a branch is predicted, then the 
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branch destination address associated with this entry is used for prefetching the 
branch target instruction. 

Once the instruction is executed, the history portion of the appropriate entry 
is updated to reflect the result of the branch instruction. If this instruction is not rep- 
resented in the BTB, then the address of this instruction is loaded into an entry in 
the BTB; if necessary, an older entry is deleted. 

The description of the preceding two paragraphs fits, in general terms, the 
branch prediction strategy used on the original Pentium model, as well as the later 
Pentium models, including Pentium 4. However, in the case of the Pentium, a rela- 
tively simple 2-bit history scheme is used. The later Pentium models have much 
longer pipelines (20 stages for the Pentium 4 compared with 5 stages for the Pen- 
tium) and therefore the penalty for misprediction is greater. Accordingly, the later 
Pentium models use a more elaborate branch prediction scheme with more history 
bits to reduce the misprediction rate. 

The Pentium 4 BTB is organized as a four-way set-associative cache with 
512 lines. Each entry uses the address of the branch as a tag. The entry also in- 
cludes the branch destination address for the last time this branch was taken and 
a 4-bit history field. Thus use of four history bits contrasts with the 2 bits used in 
the original Pentium and used in most superscalar processors. With 4 bits, the 
Pentium 4 mechanism can take into account a longer history in predicting 
branches. The algorithm that is used is referred to as Yeh’s algorithm [YEH91]. 
The developers of this algorithm have demonstrated that it provides a significant 
reduction in misprediction compared to algorithms that use only 2 bits of history 
[EVER98]. 

Conditional branches that do not have a history in the BTB are predicted 
using a static prediction algorithm, according to the following rules: 


e For branch addresses that are not IP relative, predict taken if the branch is a 
return and not taken otherwise. 


e For IP-relative backward conditional branches, predict taken. This rule reflects 
the typical behavior of loops. 


e For IP-relative forward conditional branches, predict not taken. 


TRACE CACHE FETCH The trace cache (Figure 14.9c) takes the already-decoded 
micro-ops from the instruction decoder and assembles them in to program-ordered 
sequences of micro-ops called traces. Micro-ops are fetched sequentially from the 
trace cache, subject to the branch prediction logic. 

A few instructions require more than four micro-ops. These instructions are 
transferred to microcode ROM, which contains the series of micro-ops (five or more) 
associated with a complex machine instruction. For example, a string instruction may 
translate into a very large (even hundreds), repetitive sequence of micro-ops. Thus, 
the microcode ROM is a microprogrammed control unit in the sense discussed in 
Part Four. After the microcode ROM finishes sequencing micro-ops for the current 
Pentium instruction, fetching resumes from the trace cache. 


DRIVE The fifth stage (Figure 14.9d) of the Pentium 4 pipeline delivers decoded in- 
structions from the trace cache to the rename/allocator module. 
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Out-of-Order Execution Logic 


This part of the processor reorders micro-ops to allow them to execute as quickly as 
their input operands are ready. 


ALLOCATE The allocate stage (Figure 14.9e) allocates resources required for exe- 
cution. It performs the following functions: 


e If a needed resource, such as a register, is unavailable for one of the three 
micro-ops arriving at the allocator during a clock cycle, the allocator stalls 
the pipeline. 

e The allocator allocates a reorder buffer (ROB) entry, which tracks the 
completion status of one of the 126 micro-ops that could be in process at 
any time.” 

e The allocator allocates one of the 128 integer or floating-point register entries 
for the result data value of the micro-op, and possibly a load or store buffer 
used to track one of the 48 loads or 24 stores in the machine pipeline. 


e The allocator allocates an entry in one of the two micro-op queues in front of 
the instruction schedulers. 


The ROB is a circular buffer that can hold up to 126 micro-ops and also con- 
tains the 128 hardware registers. Each buffer entry consists of the following fields: 


e State: Indicates whether this micro-op is scheduled for execution, has been 
dispatched for execution, or has completed execution and is ready for 
retirement. 


e Memory Address: The address of the Pentium instruction that generated the 
micro-op. 
e Micro-op: The actual operation. 


e Alias Register: If the micro-op references one of the 16 architectural registers, 
this entry redirects that reference to one of the 128 hardware registers. 


Micro-ops enter the ROB in order. Micro-ops are then dispatched from the 
ROB to the Dispatch/Execute unit out of order. The criterion for dispatch is that the 
appropriate execution unit and all necessary data items required for this micro-op 
are available. Finally, micro-ops are retired from the ROB in order. To accomplish 
in-order retirement, micro-ops are retired oldest first after each micro-op has been 
designated as ready for retirement. 


REGISTER RENAMING The rename stage (Figure 14.9e) remaps references to the 
16 architectural registers (8 floating-point registers, plus EAX, EBX, ECX, EDX, 
ESI, EDI, EBP, and ESP) into a set of 128 physical registers. The stage removes false 
dependencies caused by a limited number of architectural registers while preserving 
the true data dependencies (reads after writes). 


See Appendix I for a discussion of reorder buffers. 
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MICRO-OP QUEUING After resource allocation and register renaming, micro-ops 
are placed in one of two micro-op queues (Figure 14.9f), where they are held until 
there is room in the schedulers. One of the two queues is for memory operations 
(loads and stores) and the other for micro-ops that do not involve memory refer- 
ences. Each queue obeys a FIFO (first-in-first-out) discipline, but no order is main- 
tained between queues. That is, a micro-op may be read out of one queue out of 
order with respect to micro-ops in the other queue. This provides greater flexibility 
to the schedulers. 


MICRO-OP SCHEDULING AND DISPATCHING The schedulers (Figure 14.9g) are 
responsible for retrieving micro-ops from the micro-op queues and dispatching 
these for execution. Each scheduler looks for micro-ops in whose status indicates 
that the micro-op has all of its operands. If the execution unit needed by that 
micro-op is available, then the scheduler fetches the micro-op and dispatches it to 
the appropriate execution unit (Figure 14.9h). Up to six micro-ops can be dis- 
patched in one cycle. If more than one micro-op is available for a given execution 
unit, then the scheduler dispatches them in sequence from the queue. This is a sort 
of FIFO discipline that favors in-order execution, but by this time the instruction 
stream has been so rearranged by dependencies and branches that it is substan- 
tially out of order. 

Four ports attach the schedulers to the execution units. Port 0 is used for both 
integer and floating-point instructions, with the exception of simple integer opera- 
tions and the handling of branch mispredictions, which are allocated to Port 1. In ad- 
dition, MMX execution units are allocated between these two ports. The remaining 
ports are for memory loads and stores. 


Integer and Floating-Point Execution Units 


The integer and floating-point register files are the source for pending operations by 
the execution units (Figure 14.9i). The execution units retrieve values from the reg- 
ister files as well as from the L1 data cache (Figure 14.9j). A separate pipeline stage 
is used to compute flags (e.g., zero, negative); these are typically the input to a 
branch instruction. 

A subsequent pipeline stage performs branch checking (Figure 14.9k). This 
function compares the actual branch result with the prediction. If a branch predic- 
tion turns out to have been wrong, then there are micro-operations in various stages 
of processing that must be removed from the pipeline. The proper branch destina- 
tion is then provided to the Branch Predictor during a drive stage (Figure 14.91), 
which restarts the whole pipeline from the new target address. 


14.4 ARM CORTEX-A8 


Recent implementations of the ARM architecture have seen the introduction of su- 
perscalar techniques in the instruction pipeline. In this section, we focus on the ARM 
Cortex-A8, which provides a good example of a RISC-based superscalar design. 
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Figure 14.10 Architectural Block Diagram of ARM Cortex-A8 


The Cortex-A8 is in the ARM family of processors that ARM refers to as ap- 
plication processors. An ARM application processor is an embedded processor run- 
ning complex operating systems for wireless, consumer and imaging applications. 
The Cortex-A8 targets a wide variety of mobile and consumer applications includ- 
ing mobile phones, set-top boxes, gaming consoles and automotive navigation/ 
entertainment systems. 

Figure 14.10 shows a logical view of the Cortex-A8 architecture, emphasizing 
the flow of instructions among functional units. The main instruction flow is through 
three functional units that implement a dual, in-order-issue, 13-stage pipeline. The 
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(c) Instruction execute and load/store pipeline 


Figure 14.11 ARM Cortex-A8 Integer Pipeline 


Cortex designers decided to stay with in-order issue to keep additional power re- 
quired to a minimum. Out-of-order issue and retire can require extensive amounts 
of logic consuming extra power. 

Figure 14.11 shows the details of the main Cortex-A8 pipeline. There is a 
separate unit for SIMD (single-instruction-multiple-data) unit that implements a 
10-stage pipeline. 


Instruction Fetch Unit 


The instruction fetch unit predicts the instruction stream, fetches instructions from the 
L1 instruction cache, and places the fetched instructions into a buffer for consumption 
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by the decode pipeline. The instruction fetch unit also includes the L1 instruction 
cache. Because there can be several unresolved branches in the pipeline, instruction 
fetches are speculative, meaning there is no guarantee that they are executed. A 
branch or exceptional instruction in the code stream can cause a pipeline flush, dis- 
carding the currently fetched instructions. The instruction fetch unit can fetch up to 
four instructions per cycle, and goes through the following stages: 


FO The address generation unit (AGU) generates a new virtual address. Nor- 
mally, this address is the next address sequentially from the preceding fetch 
address. The address can also be a branch target address provided by a branch 
prediction for a previous instruction. FO is not counted as part of the 13-stage 
pipeline, because ARM processors have traditionally defined instruction 
cache access as the first stage. 


F1 The calculated address is used to fetch instructions from the L1 instruction 
cache. In parallel, the fetch address is used to access the branch prediction arrays 
to determine if the next fetch address should be based on a branch prediction. 


F3 Instruction data are placed into the instruction queue. If an instruction 
results in branch prediction, the new target address is sent to the address gen- 
eration unit. 


To minimize the branch penalties typically associated with a deeper pipeline, the 
Cortex-A8 processor implements a two-level global history branch predictor, consist- 
ing of the branch target buffer (BTB) and the global history buffer (GHB). These data 
structures are accessed in parallel with instruction fetches. The BTB indicates whether 
or not the current fetch address will return a branch instruction and its branch target 
address. It contains 512 entries. On a hit in the BTB a branch is predicted and the 
GHB is accessed. The GHB consists of 4096 2-bit counters that encode the strength 
and direction information of branches. The GHB is indexed by 10-bit history of the di- 
rection of the last ten branches encountered and 4 bits of the PC. In addition to the dy- 
namic branch predictor, a return stack is used to predict subroutine return addresses. 
The return stack has eight 32-bit entries that store the link register value in r14 and the 
ARM or Thumb state of the calling function. When a return-type instruction is pre- 
dicted taken, the return stack provides the last pushed address and state. 

The instruction fetch unit can fetch and queue up to 12 instructions. It issues 
instructions to the decode unit two at a time. The queue enables the instruction fetch 
unit to prefetch ahead of the rest of the integer pipeline and build up a backlog of 
instructions ready for decoding. 


Instruction Decode Unit 


The instruction decode unit decodes and sequences all ARM and Thumb instructions. 
It has a dual pipeline structure, called pipeO and pipel, so that two instructions can 
progress through the unit at a time. When two instructions are issued from the instruc- 
tion decode pipeline, pipeO will always contain the older instruction in program order. 
This means that if the instruction in pipe0 cannot issue, then the instruction in pipel 
will not issue. All issued instructions progress in order down the execution pipeline 
with results written back into the register file at the end of the execution pipeline. This 
in-order instruction issue and retire prevents WAR hazards and keeps tracking of 
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WAW hazards and recovery from flush conditions straightforward. Thus, the main con- 
cern of the instruction decode pipeline is the prevention of RAW hazards. 
Each instruction goes through five stages of processing. 


DO Thumb instructions are decompressed into 32-bit ARM instructions. A 
preliminary decode function is performed. 


D1 The instruction decode function is completed. 


D2 This stage writes instructions into and read instructions from the pending/ 
replay queue structure. 


D3 This stage contains the instruction scheduling logic. A scoreboard predicts 
register availability using static scheduling techniques.* Hazard checking is 
also done at this stage. 


D4 Performs the final decode for all the control signals required by the inte- 
ger execute and load/store units. 


In the first two stages, the instruction type, the source and destination operands, 
and resource requirements for the instruction are determined. A few less commonly 
used instructions are referred to as multicycle instructions. The D1 stage breaks these 
instructions down into multiple instruction opcodes that are sequenced individually 
through the execution pipeline. 

The pending queue serves two purposes. First, it prevents a stall signal from 
D3 from rippling any further up the pipeline. Second, by buffering instructions, 
there should always be two instructions available for the dual pipeline. In the case 
where only one instruction is issued, the pending queue enables two instructions to 
proceed down the pipeline together, even if they were originally sent from the fetch 
unit in different cycles. 

The replay operation is designed to deal with the effects of the memory system 
on instruction timing. Instructions are statically scheduled in the D3 stage based on 
a prediction of when the source operand will be available. Any stall from the mem- 
ory system can result in the minimum of an 8-cycle delay. This 8-cycle delay mini- 
mum is balanced with the minimum number of possible cycles to receive data from 
the L2 cache in the case of an L1 load miss. Table 14.2 gives the most common cases 
that can result in an instruction replay because of a memory system stall. 

To deal with these stalls, a recovery mechanism is used to flush all subsequent 
instructions in the execution pipeline and reissue (replay) them. To support replay, 
instructions are copied into the replay queue before they are issued and removed as 
they write back their results and retire. If a replay signal is issued instructions are re- 
trieved from the replay queue and reenter the pipeline. 

The decode unit issues two instructions in parallel to the execution unit, unless it 
encounters an issue restriction. Table 14.3 shows the most common restriction cases. 


Integer Execute Unit 


The instruction execute unit consists of two symmetric arithmetic logic unit (ALU) 
pipelines, an address generator for load and store instructions, and the multiply 


3See Appendix I for a discussion of scoreboarding. 
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Table 14.2 Cortex-A8 Memory System Effects on Instruction Timings 


Replay Event Delay Description 


Load data miss 8 cycles 1. A load instruction misses in the L1 data cache. 
2. A request is then made to the L2 data cache. 


3. If a miss also occurs in the L2 data cache, then a sec- 
ond replay occurs. The number of stall cycles depends 
on the external system memory timing. The minimum 
time required to receive the critical word for an L2 
cache miss is approximately 25 cycles, but can be 
much longer because of L3 memory latencies. 


Data TLB miss 24 cycles . A table walk because of a miss in the L1 TLB causes 
a 24-cycle delay, assuming the translation table 
entries are found in the L2 cache. 


. If the translation table entries are not present in the 
L2 cache, the number of stall cycles depends on the 
external system memory timing. 


Store buffer full 8 cycles plus latency . A store instruction miss does not result in any stalls 
to drain fill buffer unless the store buffer is full. 


. In the case of a full store buffer, the delay is at least 
eight cycles. The delay can be more if it takes longer 
to drain some entries from the store buffer. 


Unaligned load or 8 cycles . Ifa load instruction address is unaligned and the full 
store request access is not contained within a 128-bit boundary, 
there is a 8-cycle penalty. 


. If a store instruction address is unaligned and the full 
access is not contained within a 64-bit boundary, 
there is a 8-cycle penalty. 


pipeline. The execute pipelines also perform register write back. The instruction 
execute unit: 
e Executes all integer ALU and multiply operations, including flag generation 


e Generates the virtual addresses for loads and stores and the base write-back 
value, when required 


e Supplies formatted data for stores and forwards data and flags 
e Processes branches and other changes of instruction stream and evaluates in- 
struction condition codes 


For ALU instructions, either pipeline can be used, consisting of the following 
stages: 


EO Access register file. Up to six registers can be read from the register file for 
two instructions. 

E1 The barrel shifter (Figure 12.25) performs its function, if needed. 

E2 The ALU unit (Figure 12.25) performs its function. 


E3 If needed, this stage completes saturation arithmetic used by some ARM 
data processing instructions. 
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Table 14.3. Cortex-A8 Dual-Issue Restrictions 


Restriction 
Type 


Description 


Example 


Restriction 


Load/store 
resource 
hazard 


There is only one LS pipeline. 
Only one LS instruction can 
be issued per cycle. It can be in 
pipeline 0 or pipeline 1 


LDR 15, [r6] 
STR 17, [18] 
MOV 19, 110 


Wait for LS unit 
Dual issue possible 


Multiply 
resource 
hazard 


There is only one multiply 
pipeline, and it is only 
available in pipeline 0. 


ADD 11, 12, 13 
MUL £4, 15, r6 
MUL 17, £8, r9 


Wait for pipeline 0 
Wait for multiply unit 


Branch 
resource 
hazard 


There can be only one 
branch per cycle. It can be in 
pipeline 0 or pipeline 1.A 
branch is any instruction that 
changes the PC. 


BX rl 
BEQ 0x1000 
ADD 11, 12, 13 


Wait for branch Dual 
issue possible 


Data output 
hazard 


Instructions with the same 
destination cannot be issued in 
the same cycle. This can happen 
with conditional code. 


MOVEQ 11, 12 
MOVNE rl, 13 


LDR 15, [r6] 


Wait because of output 
dependency 
Dual issue possible 


Data 
source 
hazard 


Instructions cannot be issued 
if their data is not available. 
See the scheduling tables for 
source requirements and 
stages results. 


ADD 11, 12, 13 
ADD f4,r1, 16 
LDR 17, [r4] 


Wait for r1 
Wait two cycles for r4 


Multi-cycle 
instructions 


Multi-cycle instructions must 
issue in pipeline 0 and can 
only dual issue in their last 
iteration. 


MOV 11,12 
LDM 13, {r4-r7} 
LDM (cycle 2) 
LDM (cycle 3) 


ADD £8, 19, 110 


Wait for pipeline 0, transfer r4 

Transfer r5, r6 

Transfer r7 

Dual issue possible on last 
transfer 


E4 Any change in control flow, including branch misprediction, exceptions, 
and memory system replays are prioritized and processed. 


ES5 Results of ARM instructions are written back into the register file. 


Instructions that invoke the multiply unit (Figure 12.25) are routed to pipe0; the 
multiply operation is performed in stages E1 through E3, and the multiply accumulate 
operation in stage E4. 

The load/store pipeline runs parallel to the integer pipeline. The stages are as 
follows: 


E1 The memory address is generated from the base and index register. 


E2 The address is applied to the cache arrays. 


E3 In the case of a load, data are returned and formatted for forwarding to 
the ALU or MUL unit. In the case of a store, the data are formatted and ready 
to be written into the cache. 


E4 Performs updates to the L2 cache, if required. 


ES5 Results of ARM instructions are written back into the register file. 
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Cortex-A8 Example Dual Issue Instruction Sequence for Integer Pipeline 


Program Counter 


Instruction 


Timing Description 


0x00000ed0 


Bx rl4 


Dual issue pipeline 0 


0x00000ee4 


CMP 10,#0 


Dual issue in pipeline 1 


0x00000ee8 
0x00000eec 


MOV 13,#3 
MOV 10,#0 


Dual issue pipeline 0 


Dual issue in pipeline 1 


0x00000ef0 


STREQ 73,[r1,#0] 


Dual issue in pipeline 0, r3 not needed 
until E3 


0x00000ef4 


CMP 12,#4 


Dual issue in pipeline 1 


0x00000ef8 


LDRLS pc,[pe,r2,LSL #2] 


Single issue pipeline 0, +1 cycle for 
load to pc, no extra cycle for shift since 
LSL #2 


0x00000f2c 


MOV 10,#1 


Dual issue with 2nd iteration of load in 
pipeline 1 


0x00000f30 


B {pc}+8 


#0xf38 dual issue pipeline 0 


Ox00000f38 


STR 10,[r1,#0] 


Dual issue pipeline 1 


0x00000f3c: 


LDR pc,[r13],#4 


Single issue pipeline 0, +1 cycle for load 
to pe 


0x0000017c 


ADD 12,r4,#0xc 


Dual issue with 2nd iteration of load in 
pipeline 1 


0x00000180 


LDR 10,[r6,#4] 


Dual issue pipeline 0 


0x00000184 


MOV 11,#0xa 


Dual issue pipeline 1 


0x00000188 


LDR 10,[r0,#0] 


Single issue pipeline 0: r0 produced in 
E3, required in El, so +2 cycle stall 


0x0000018c 


STR 10,[r4,#0] 


Single issue pipeline 0 due to LS resource 
hazard, no extra delay for r0 since 
produced in E3 and consumed in E3 


0x00000190 


0x00000194 


LDR 10,[r4,#0xc] 


LDMFD 113!,{r4-r6,114} 


Single issue pipeline 0 due to LS resource 
hazard 


Load multiple loads r4 in Ist cycle, r5 and 
r6 in 2nd cycle, r14 in 3rd cycle, 3 cycles 
total 


0x00000198 


B {pc}+0xda8 


#0xf40 dual issue in pipeline 1 with 3rd 
cycle of LDM 


Ox00000f40 


ADD 10,r0,#2 ARM 


Single issue in pipeline 0 


Ox00000f44 


ADD 10,r1,r0 ARM 


Single issue in pipeline 0, no dual issue 
due to hazard on 10 produced in E2 and 
required in E2 


Table 14.4 shows a sample code segment and indicates how the processor 


might schedule it. 


All SIMD and floating-point instructions pass through the integer pipeline and 
are processed in a separate 10-stage pipeline (Figure 14.12). This unit, referred to 


552 CHAPTER 14 / INSTRUCTION-LEVEL PARALLELISM & SUPERSCALAR PROCESSORS 


NEON register writeback 


MUL MUL ACC ACC 
DUP > ; | 5 | : real - >| WB Integer 
: ALU, 
Instruction decode MAC 
Shift 1 P| Shift 2 P| Shift 3 >] nal p>) WB SHIFT 
16-entry Dec Scores REg pipes 
Inst P| tal P| read +>] 
queue board FMT > ALU | ABS ial P| WB 
queue rf a ap 
ae Rd/Wr Issue me 
ns : wding 
p>) check fe) logic FMUL|, | FMUL|, |FMUL |, | FMUL Non-IEEE 
Dec muxes > ee ee” ee)” (ee |e FMUL pipe 
FADD |_|FADD| |FADD| |FADD Non-IEEE 
"a 6 6» 6 « ES 
IEEE 
Load and store VFP | WB single/double 
with alignment precision VFP 
3 uM Ea Load/store 
Mux L1/ p>] °S2"Y | Load Pp] VU 4->)PERM | |PERM| | Store | |S cntty a 
store Ali with Pm >) Pl Ation PI store > | WB 
MCR Pp queue >) “289 >| NRE OP 1 Lei) queue permute 


Figure 14.12 ARM Cortex-A8 NEON and Floating-Point Pipeline 


as the NEON unit, handles packed SIMD instructions, and provides two types of 
floating-point support. If implemented, a vector floating-point (VFP) coprocessor 
performs floating-point operations in compliance with IEEE 754. If the coproces- 
sor is not present, then separate multiply and add pipelines implement the floating- 
point operations 


14.5 RECOMMENDED READING 


Two good book-length treatments of superscalar design are [SHENOS5] and [OMON99]. 
Worthwhile survey articles on the subject are [SMIT95] and [SIMA97]. [JOUP89a] examines 
instruction-level parallelism, looks at various techniques for maximizing parallelism, and 
compares superscalar and superpipelined approaches using simulation. Recent papers that 
provide good coverage of superscalar design issues include [SIMA04], [PATTO1], and 
[MOSH01]. 

[POPE91] provides a detailed look at a proposed superscalar machine. It also provides 
an excellent tutorial on the design issues related to out-of-order instruction policies. Another 
look at a proposed system is found in [KUGA91]; this article raises and considers most of the 
important design issues for superscalar implementation. [LEE91] examines software tech- 
niques that can be used to enhance superscalar performance. [WALL91] is an interesting 
study of the extent to which instruction-level parallelism can be exploited in a superscalar 
processor. 
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Volume I of [INTE04a] provides general description of the Pentium 4 pipeline; more 
detail is provided in [INTE01a] and [INTE01b]. Another detailed treatment is [FOGO8b]. 

[JOHNO8] and [ARMO8a] provide thorough coverage of the ARM Cortex-A8 
pipeline. [RICHO07] is a good overview. 
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14.6 KEY TERMS, REVIEW QUESTIONS, AND PROBLEMS 


Key Terms 


antidependency machine parallelism register renaming 
branch prediction micro-operations resource conflict 
commit micro-ops retire 

flow dependency out-of-order superpipelined 


in-order issue completion superscalar 

in-order completion out-of-order issue true data dependency 
instruction issue output dependency write-read dependency 
instruction-level parallelism procedural dependency write-write 

instruction window read-write dependency dependency 


Review Questions 


14.1 What is the essential characteristic of the superscalar approach to processor design? 
14.2 What is the difference between the superscalar and superpipelined approaches? 
14.3. What is instruction-level parallelism? 


14.4 Briefly define the following terms: 
True data dependency 
Procedural dependency 
Resource conflicts 

Output dependency 
Antidependency 


14.5 What is the distinction between instruction-level parallelism and machine parallelism? 
14.6 List and briefly define three types of superscalar instruction issue policies. 

14.7 What is the purpose of an instruction window? 

14.8 What is register renaming and what is its purpose? 

14.9 What are the key elements of a superscalar processor organization? 


Problems 


14.1. When out-of-order completion is used in a superscalar processor, resumption of exe- 
cution after interrupt processing is complicated, because the exceptional condition 
may have been detected as an instruction that produced its result out of order. The 
program cannot be restarted at the instruction following the exceptional instruction, 
because subsequent instructions have already completed, and doing so would cause 
these instructions to be executed twice. Suggest a mechanism or mechanisms for deal- 
ing with this situation. 
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14.2 Consider the following sequence of instructions, where the syntax consists of an op- 


14.3 


code followed by the destination register followed by one or two source registers: 


) ADD R3, Rl, R2 
1 LOAD R6, [R3] 
2 AND R7, R5, 3 
3 ADD R1, R6, RO 
4 SRL R7, RO, 8 
5 OR R2, R4, R7 
6 SUB R5, R3, R4 
7 ADD RO, R1, R10 
8 LOAD R6, [R5] 
9 SUB R2, Rl, R6 
10 AND R3, R7, 15 


Assume the use of a four-stage pipeline: fetch, decode/issue, execute, write back. As- 
sume that all pipeline stages take one clock cycle except for the execute stage. For 
simple integer arithmetic and logical instructions, the execute stage takes one cycle, 
but for a LOAD from memory, five cycles are consumed in the execute stage. 

If we have a simple scalar pipeline but allow out-of-order execution, we can 
construct the following table for the execution of the first seven instructions: 


Instruction Fetch Decode Execute Write Back 

0 0 1 2 3 

1 1 2 

2 2 3 

3 3 4 10 11 

4 4 5 6 7 

5 5 6 8 10 

6 6 7: 9 12 


The entries under the four pipeline stages indicate the clock cycle at which each instruc- 

tion begins each phase. In this program, the second ADD instruction (instruction 3) de- 

pends on the LOAD instruction (instruction 1) for one of its operands, r6. Because the 

LOAD instruction takes five clock cycles, and the issue logic encounters the dependent 

ADD instruction after two clocks, the issue logic must delay the ADD instruction for 

three clock cycles. With an out-of-order capability, the processor can stall instruction 3 at 

clock cycle 4, and then move on to issue the following three independent instructions, 

which enter execution at clocks 6, 8, and 9. The LOAD finishes execution at clock 9, and 

so the dependent ADD can be launched into execution on clock 10. 

a. Complete the preceding table. 

b. Redo the table assuming no out-of-order capability. What is the savings using the 
capability? 

c. Redo the table assuming a superscalar implementation that can handle two in- 
structions at a time at each stage. 


Consider the following assembly language program: 


I1: Move R3, R7 /R3 < (R7)/ 

I2: Load R8, (R3) /R8 €< Memory (R3)/ 

I3: Add R3, R3, 4 /R3 << (R3) + 4/ 

T4: Load R9, (R3) /R9 € Memory (R3)/ 

I5+ BLE R8, R98, L3 /Branch if (R9) > (R8)/ 


This program includes WAW, RAW, and WAR dependencies. Show these. 
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14.4 


14.5 


14.6 


Decode Execute Write Cycle 
Il 10 1 
10 Il 2 
10 Il 3 
1 4 
14 13 Il 2 5 
15 13 6 
15 16 B 14 7 
8 
is 16 9 


Figure 14.13 
Sequence 


An In-Order Issue, In-Order-Completion Execution 


a. Identify the write-read, write-write, and read-write dependencies in the following 
instruction sequence: 


I1: R1 = 100 

I2: Rl = R2 + R4 
732 .R2-sS ards = 25 
I4: R4 = R1 + R3 
I5: R1 = R1 + 30 


b. Rename the registers from part (a) to prevent dependency problems. Identify ref- 
erences to initial register values using the subscript “a” to the register reference. 


Consider the “in-order-issue/in-order-completion” execution sequence shown in 

Figure 14.13. 

a. Identify the most likely reason why I2 could not enter the execute stage until the 
fourth cycle. Will “in-order issue/out-of-order completion” or “out-of-order issue/ 
out-of-order completion” fix this? If so, which? 

b. Identify the reason why I6 could not enter the write stage until the nineth cycle. 
Will “in-order issue/out-of-order completion” or “out-of-order issue/out-of-order 
completion” fix this? If so, which? 


Figure 14.14 shows an example of a superscalar processor organization. The processor 
can issue two instructions per cycle if there is no resource conflict and no data depen- 
dence problem. There are essentially two pipelines, with four processing stages (fetch, 
decode, execute, and store). Each pipeline has its own fetch decode and store unit. 
Four functional units (multiplier, adder, logic unit, and load unit) are available for use 
in the execute stage and are shared by the two pipelines on a dynamic basis. The two 
store units can be dynamically used by the two pipelines, depending on availability at 
a particular cycle. There is a lookahead window with its own fetch and decoding logic. 
This window is used for instruction lookahead for out-of-order instruction issue. 
Consider the following program to be executed on this processor: 


I1: Load R1, A /R1L <— Memory (A)/ 
I2: Add R2, R1 /R2 < (R2) + R(1)/ 
I3: Add R3, R4 /R3 < (R3) + R(4)/ 
T4: Mul R4, R5 /R4 & (R4) + R(5)/ 
I5: Comp R6 /R6 <&— (R6)/ 

I6: Mul R6, R7 /R3 < (R3) + R(4)/ 


a. What dependencies exist in the program? 

b. Show the pipeline activity for this program on the processor of Figure 14.14 using 
in-order issue with in-order completion policies and using a presentation similar 
to Figure 14.2. 
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Fetch | Decode | Execute stage | Store | 
sage | sue unite | | 
m1 | m2 | m3 

fl —| dl : Adder sl ——. 
| | al | a2 | 
[2 a2 | ! Logic ! 
! | el | 

£3 d3 | Load s2 +> 


Lookahead window 


e2 


Figure 14.14 A Dual-Pipeline Superscalar Processor 


c. Repeat for in-order issue with out-of-order completion. 
d. Repeat for out-of-order issue with out-of-order completion. 

14.7 Figure 14.15 is from a paper on superscalar design. Explain the three parts of the fig- 
ure, and define w, x, y, and z. 

14.8 Yeh’s dynamic branch prediction algorithm, used on the Pentium 4, is a two-level 
branch prediction algorithm. The first level is the history of the last n branches. The 
second level is the branch behavior of the last s occurrences of that unique pattern of 
the last n branches. For each conditional branch instruction in a program, there is an 
entry in a Branch History Table (BHT). Each entry consists of 1 bits corresponding to 
the last n executions of the branch instruction, with a 1 if the branch was taken and a 


From w ——— -—— Tox, y,z —— | -——- Tox 
(a) 
From w —— | [—— Toy 
—__7—~ wo: 
(b) 
——_i -—— Tox 


From w ——— 


(c) 
Figure 14.15 Figure for Problem 14.7 
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Figure 14.16 Figure for Problem 14.8 


0 if the branch was not. Each BHT entry indexes into a Pattern Table (PT) that has 2n 
entries, one for each possible pattern of n bits. Each PT entry consists of s bits that are 
used in branch prediction, as was described in Chapter 12 (e.g., Figure 12.19). When a 
conditional branch is encountered during instruction fetch and decode, the address of 
the instruction is used to retrieve the appropriate BHT entry, which shows the recent 
history of the instruction. Then, the BHT entry is used to retrieve the appropriate PT 
entry for branch prediction. After the branch is executed, the BHT entry is updated, 
and then the appropriate PT entry is updated. 

a. In testing the performance of this scheme, Yeh tried five different prediction 
schemes, illustrated in Figure 14.16. Identify which three of these schemes corre- 
spond to those shown in Figures 12.19 and 12.28. Describe the remaining two 
schemes. 

b. With this algorithm, the prediction is not based on just the recent history of this 
particular branch instruction. Rather, it is based on the recent history of all pat- 
terns of branches that match the n-bit pattern in the BHT entry for this instruc- 
tion. Suggest a rationale for such a strategy. 


. Qv Seb eR. — “ES wae 
PART FOUR 


The Control Unit 


P.1 ISSUES FOR PART FOUR 


In Part Three, we focused on machine instructions and the operations performed 
by the processor to execute each instruction. What was left out of this discussion is 
exactly how each individual operation is caused to happen. This is the job of the 
control unit. 

The control unit is that portion of the processor that actually causes things to 
happen. The control unit issues control signals external to the processor to cause 
data exchange with memory and I/O modules. The control unit also issues control 
signals internal to the processor to move data between registers, to cause the ALU 
to perform a specified function, and to regulate other internal operations. Input to 
the control unit consists of the instruction register, flags, and control signals from 
external sources (e.g., interrupt signals). 


ROAD MAP FOR PART FOUR 


Chapter 15 Control Unit Operation 


In Chapter 15, we turn to a discussion of how processor functions are per- 
formed or, more specifically, how the various elements of the processor are 
controlled to provide these functions, by means of the control unit. It is shown 
that each instruction cycle is made up of a set of micro-operations that gener- 
ate control signals. Execution is accomplished by the effect of these control 
signals, emanating from the control unit to the ALU, registers, and system 
interconnection structure. Finally, an approach to the implementation of the 
control unit, referred to as hardwired implementation, is presented. 


Chapter 16 Microprogrammed Control 


In Chapter 16, we see how the concept of micro-operation leads to an 
elegant and powerful approach to control unit implementation, known as 
microprogramming. In essence, a lower-level programming language is 
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developed. Each instruction in the machine language of the processor is 
translated into a sequence of lower-level control unit instructions. These 
lower-level instructions are referred to as microinstructions, and the 


process of translation is referred to as microprogramming. The chapter 
describes the layout of a control memory containing a microprogram for 
each machine instruction is described. The structure and function of the 
microprogrammed control unit can then be explained. 
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15.1 Micro-Operations 


The Fetch Cycle 

The Indirect Cycle 
The Interrupt Cycle 
The Execute Cycle 
The Instruction Cycle 
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Functional Requirements 
Control Signals 
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Internal Processor Organization 
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15.3 Hardwired Implementation 


Control Unit Inputs 
Control Unit Logic 


15.4 Recommended Reading 
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KEY POINTS 


® The execution of an instruction involves the execution of a sequence of 
substeps, generally called cycles. For example, an execution may consist of 
fetch, indirect, execute, and interrupt cycles. Each cycle is in turn made up 
of a sequence of more fundamental operations, called micro-operations. 
A single micro-operation generally involves a transfer between registers, 
a transfer between a register and an external bus, or a simple ALU 
operation. 


® The control unit of a processor performs two tasks: (1) It causes the proces- 
sor to step through a series of micro-operations in the proper sequence, 
based on the program being executed, and (2) it generates the control sig- 
nals that cause each micro-operation to be executed. 


® The control signals generated by the control unit cause the opening and 
closing of logic gates, resulting in the transfer of data to and from registers 
and the operation of the ALU. 


® One technique for implementing a control unit is referred to as hardwired 
implementation, in which the control unit is a combinatorial circuit. Its 
input logic signals, governed by the current machine instruction, are trans- 
ferred into a set of output control signals. 


In Chapter 10, we pointed out that a machine instruction set goes a long way toward 
defining the processor. If we know the machine instruction set, including an under- 
standing of the effect of each opcode and an understanding of the addressing modes, 
and if we know the set of user-visible registers, then we know the functions that the 
processor must perform. This is not the complete picture. We must know the external 
interfaces, usually through a bus, and how interrupts are handled. With this line of rea- 
soning, the following list of those things needed to specify the function of a processor 
emerges: 


1. Operations (opcodes) 

2. Addressing modes 

3. Registers 

4. I/O module interface 

5. Memory module interface 
6. Interrupts 


This list, though general, is rather complete. Items 1 through 3 are defined by the in- 
struction set. Items 4 and 5 are typically defined by specifying the system bus. Item 6 is 
defined partially by the system bus and partially by the type of support the processor 
offers to the operating system. 

This list of six items might be termed the functional requirements for a processor. 
They determine what a processor must do. This is what occupied us in Parts Two and 
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Three. Now, we turn to the question of how these functions are performed or, more 
specifically, how the various elements of the processor are controlled to provide these 
functions. Thus, we turn to a discussion of the control unit, which controls the operation 
of the processor. 


15.1 MICRO-OPERATIONS 


We have seen that the operation of a computer, in executing a program, consists of a 
sequence of instruction cycles, with one machine instruction per cycle. Of course, we 
must remember that this sequence of instruction cycles is not necessarily the same 
as the written sequence of instructions that make up the program, because of the ex- 
istence of branching instructions. What we are referring to here is the execution time 
sequence of instructions. 

We have further seen that each instruction cycle is made up of a number of 
smaller units. One subdivision that we found convenient is fetch, indirect, execute, 
and interrupt, with only fetch and execute cycles always occurring. 

To design a control unit, however, we need to break down the description fur- 
ther. In our discussion of pipelining in Chapter 12, we began to see that a further de- 
composition is possible. In fact, we will see that each of the smaller cycles involves 
a series of steps, each of which involves the processor registers. We will refer to these 
steps as micro-operations. The prefix micro refers to the fact that each step is very 
simple and accomplishes very little. Figure 15.1 depicts the relationship among the 
various concepts we have been discussing. To summarize, the execution of a pro- 
gram consists of the sequential execution of instructions. Each instruction is exe- 
cuted during an instruction cycle made up of shorter subcycles (e.g., fetch, indirect, 
execute, interrupt). The execution of each subcycle involves one or more shorter op- 
erations, that is, micro-operations. 

Micro-operations are the functional, or atomic, operations of a processor. In 
this section, we will examine micro-operations to gain an understanding of how 


Program execution 


ne a ee ee 


Instruction cycle Instruction cycle eee Instruction cycle 
Fetch Indirect Execute Interrupt 
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Figure 15.1 Constituent Elements of a Program Execution 
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the events of any instruction cycle can be described as a sequence of such micro- 
operations. A simple example will be used. In the remainder of this chapter, we 
then show how the concept of micro-operations serves as a guide to the design of 
the control unit. 


The Fetch Cycle 


We begin by looking at the fetch cycle, which occurs at the beginning of each in- 
struction cycle and causes an instruction to be fetched from memory. For purposes 
of discussion, we assume the organization depicted in Figure 12.6. Four registers are 
involved: 


e Memory address register (MAR): Is connected to the address lines of the sys- 
tem bus. It specifies the address in memory for a read or write operation. 


e Memory buffer register (MBR): Is connected to the data lines of the system bus. 
It contains the value to be stored in memory or the last value read from memory. 


e Program counter (PC): Holds the address of the next instruction to be fetched. 
e Instruction register (IR): Holds the last instruction fetched. 


Let us look at the sequence of events for the fetch cycle from the point of 
view of its effect on the processor registers. An example appears in Figure 15.2. At 
the beginning of the fetch cycle, the address of the next instruction to be executed 
is in the program counter (PC); in this case, the address is 1100100. The first step is 
to move that address to the memory address register (MAR) because this is the 
only register connected to the address lines of the system bus. The second step is to 
bring in the instruction. The desired address (in the MAR) is placed on the address 


MAR MAR | 0000000001100100 
MBR MBR 
PC} 0000000001100100 PC} 0000000001100100 
IR IR 
AC AC 
(a) Beginning (before t,) (b) After first step 
MAR | 0000000001100100 MAR | 0000000001100100 
MBR | 0001000000100000 MBR |} 0001000000100000 
PC} 0000000001100101 PC} 0000000001100101 
IR IR| 0001000000100000 
AC AC 
(c) After second step (d) After third step 


Figure 15.2 Sequence of Events, Fetch Cycle 
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bus, the control unit issues a READ command on the control bus, and the result 
appears on the data bus and is copied into the memory buffer register (MBR). We 
also need to increment the PC by the instruction length to get ready for the next in- 
struction. Because these two actions (read word from memory, increment PC) do 
not interfere with each other, we can do them simultaneously to save time. The 
third step is to move the contents of the MBR to the instruction register (IR). This 
frees up the MBR for use during a possible indirect cycle. 

Thus, the simple fetch cycle actually consists of three steps and four micro- 
operations. Each micro-operation involves the movement of data into or out of a 
register. So long as these movements do not interfere with one another, several of 
them can take place during one step, saving time. Symbolically, we can write this 
sequence of events as follows: 


t,: MAR < (PC) 
t2: MBR <— Memory 
PE < (PEC) + 2 
t3: IR < (MBR) 


where J is the instruction length. We need to make several comments about this se- 
quence. We assume that a clock is available for timing purposes and that it emits regu- 
larly spaced clock pulses. Each clock pulse defines a time unit. Thus, all time units are of 
equal duration. Each micro-operation can be performed within the time of a single 
time unit. The notation (ty, t:, t3) represents successive time units. In words, we have 


e First time unit: Move contents of PC to MAR. 

e Second time unit: Move contents of memory location specified by MAR to 
MBR. Increment by J the contents of the PC. 

e Third time unit: Move contents of MBR to IR. 


Note that the second and third micro-operations both take place during the second 
time unit. The third micro-operation could have been grouped with the fourth with- 
out affecting the fetch operation: 


t,: MAR < (PC) 
t,: MBR < Memory 
t3: PC — (PC) + I 
IR < (MBR) 


The groupings of micro-operations must follow two simple rules: 


1. The proper sequence of events must be followed. Thus (MAR < (PC)) must 
precede (MBR < Memory) because the memory read operation makes use of 
the address in the MAR. 


. Conflicts must be avoided. One should not attempt to read to and write from 
the same register in one time unit, because the results would be unpredictable. 
For example, the micro-operations (MBR <— Memory) and (IR <- MBR) should 
not occur during the same time unit. 


nN 
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A final point worth noting is that one of the micro-operations involves an 
addition. To avoid duplication of circuitry, this addition could be performed by the 
ALU. The use of the ALU may involve additional micro-operations, depending on 
the functionality of the ALU and the organization of the processor. We defer a dis- 
cussion of this point until later in this chapter. 

It is useful to compare events described in this and the following subsections to 
Figure 3.5. Whereas micro-operations are ignored in that figure, this discussion shows 
the micro-operations needed to perform the subcycles of the instruction cycle. 


The Indirect Cycle 


Once an instruction is fetched, the next step is to fetch source operands. Continuing 
our simple example, let us assume a one-address instruction format, with direct and 
indirect addressing allowed. If the instruction specifies an indirect address, then an 
indirect cycle must precede the execute cycle. The data flow differs somewhat from 
that indicated in Figure 12.7 and includes the following micro-operations: 


t,: MAR < (IR(Address) ) 
t2: MBR < Memory 
t3: IR(Address) < (MBR(Address) ) 


The address field of the instruction is transferred to the MAR. This is then used 
to fetch the address of the operand. Finally, the address field of the IR is updated 
from the MBR, so that it now contains a direct rather than an indirect address. 

The IR is now in the same state as if indirect addressing had not been used, 
and it is ready for the execute cycle. We skip that cycle for a moment, to consider the 
interrupt cycle. 


The Interrupt Cycle 


At the completion of the execute cycle, a test is made to determine whether any en- 
abled interrupts have occurred. If so, the interrupt cycle occurs. The nature of this 
cycle varies greatly from one machine to another. We present a very simple sequence 
of events, as illustrated in Figure 12.8. We have 


t;: MBR < (PC) 

t.: MAR ¢< Save_Address 
PC ¢ Routine_Address 

t3: Memory < (MBR) 


In the first step, the contents of the PC are transferred to the MBR, so that they 
can be saved for return from the interrupt. Then the MAR is loaded with the address 
at which the contents of the PC are to be saved, and the PC is loaded with the address 
of the start of the interrupt-processing routine. These two actions may each be a 
single micro-operation. However, because most processors provide multiple types 
and/or levels of interrupts, it may take one or more additional micro-operations to 
obtain the Save_Address and the Routine_Address before they can be transferred 
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to the MAR and PC, respectively. In any case, once this is done, the final step is to 
store the MBR, which contains the old value of the PC, into memory. The processor 
is now ready to begin the next instruction cycle. 


The Execute Cycle 


The fetch, indirect, and interrupt cycles are simple and predictable. Each involves a 
small, fixed sequence of micro-operations and, in each case, the same micro-operations 
are repeated each time around. 

This is not true of the execute cycle. Because of the variety opcodes, there are 
a number of different sequences of micro-operations that can occur. Let us consider 
several hypothetical examples. 

First, consider an add instruction: 


ADD R1, X 


which adds the contents of the location X to register R1. The following sequence of 
micro-operations might occur: 


t,: MAR <¢ (IR(address) ) 
tj: MBR <— Memory 
t3: Rl < (R1) + (MBR) 


We begin with the IR containing the ADD instruction. In the first step, the ad- 
dress portion of the IR is loaded into the MAR. Then the referenced memory loca- 
tion is read. Finally, the contents of R1 and MBR are added by the ALU. Again, this 
is a simplified example. Additional micro-operations may be required to extract the 
register reference from the IR and perhaps to stage the ALU inputs or outputs in 
some intermediate registers. 

Let us look at two more complex examples. A common instruction is incre- 
ment and skip if zero: 


ISZ X 


The content of location X is incremented by 1. If the result is 0, the next instruction 
is skipped. A possible sequence of micro-operations is 


t,: MAR < (IR(address) ) 
t2: MBR < Memory 

t3: MBR <— (MBR) + 1 

t,: Memory < (MBR) 

f ((MBR) = 0) then (PC < (PC) + T) 


The new feature introduced here is the conditional action. The PC is incremented 
if (MBR) = 0. This test and action can be implemented as one micro-operation. Note 
also that this micro-operation can be performed during the same time unit during 
which the updated value in MBR is stored back to memory. 
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Finally, consider a subroutine call instruction. As an example, consider a branch- 
and-save-address instruction: 


BSA X 
The address of the instruction that follows the BSA instruction is saved in location 
X, and execution continues at location X + I. The saved address will later be used 
for return. This is a straightforward technique for providing subroutine calls. The fol- 


lowing micro-operations suffice: 


t,: MAR < (IR 


address) ) 


MBR < (PC 
ty: PC < (IR(address) ) 
Memory < (MBR) 


54 PO aS CRC ak 


The address in the PC at the start of the instruction is the address of the next 
instruction in sequence. This is saved at the address designated in the IR. The latter 
address is also incremented to provide the address of the instruction for the next in- 
struction cycle. 


The Instruction Cycle 


We have seen that each phase of the instruction cycle can be decomposed into a se- 
quence of elementary micro-operations. In our example, there is one sequence each 
for the fetch, indirect, and interrupt cycles, and, for the execute cycle, there is one se- 
quence of micro-operations for each opcode. 

To complete the picture, we need to tie sequences of micro-operations to- 
gether, and this is done in Figure 15.3. We assume a new 2-bit register called the 
instruction cycle code (ICC). The ICC designates the state of the processor in terms 
of which portion of the cycle it is in: 


00: Fetch 

01: Indirect 
10: Execute 
11: Interrupt 


At the end of each of the four cycles, the ICC is set appropriately. The indirect 
cycle is always followed by the execute cycle. The interrupt cycle is always followed 
by the fetch cycle (see Figure 12.4). For both the fetch and execute cycles, the next 
cycle depends on the state of the system. 

Thus, the flowchart of Figure 15.3 defines the complete sequence of micro- 
operations, depending only on the instruction sequence and the interrupt pattern. 
Of course, this is a simplified example. The flowchart for an actual processor would 
be more complex. In any case, we have reached the point in our discussion in which 
the operation of the processor is defined as the performance of a sequence of micro- 
operations. We can now consider how the control unit causes this sequence to occur. 
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Figure 15.3. Flowchart for Instruction Cycle 
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Functional Requirements 


As a result of our analysis in the preceding section, we have decomposed the behavior 
or functioning of the processor into elementary operations, called micro-operations. 
By reducing the operation of the processor to its most fundamental level, we are 
able to define exactly what it is that the control unit must cause to happen. Thus, we 
can define the functional requirements for the control unit: those functions that 
the control unit must perform. A definition of these functional requirements is the 
basis for the design and implementation of the control unit. 

With the information at hand, the following three-step process leads to a char- 
acterization of the control unit: 


1. Define the basic elements of the processor. 
2. Describe the micro-operations that the processor performs. 
3. Determine the functions that the control unit must perform to cause the 


micro-operations to be performed. 


We have already performed steps 1 and 2. Let us summarize the results. First, 
the basic functional elements of the processor are the following: 


e ALU 
e Registers 
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e Internal data paths 
e External data paths 
e Control unit 


Some thought should convince you that this is a complete list. The ALU is the 
functional essence of the computer. Registers are used to store data internal to the 
processor. Some registers contain status information needed to manage instruction 
sequencing (e.g., a program status word). Others contain data that go to or come 
from the ALU, memory, and I/O modules. Internal data paths are used to move data 
between registers and between register and ALU. External data paths link registers 
to memory and I/O modules, often by means of a system bus. The control unit causes 
operations to happen within the processor. 

The execution of a program consists of operations involving these processor ele- 
ments. As we have seen, these operations consist of a sequence of micro-operations. 
Upon review of Section 15.1, the reader should see that all micro-operations fall into 
one of the following categories: 


e Transfer data from one register to another. 
e Transfer data from a register to an external interface (e.g., system bus). 
e Transfer data from an external interface to a register. 


e Perform an arithmetic or logic operation, using registers for input and output. 


All of the micro-operations needed to perform one instruction cycle, including all of 
the micro-operations to execute every instruction in the instruction set, fall into one 
of these categories. 

We can now be somewhat more explicit about the way in which the control 
unit functions. The control unit performs two basic tasks: 


e Sequencing: The control unit causes the processor to step through a series 
of micro-operations in the proper sequence, based on the program being 
executed. 


e Execution: The control unit causes each micro-operation to be performed. 


The preceding is a functional description of what the control unit does. The 
key to how the control unit operates is the use of control signals. 


Control Signals 


We have defined the elements that make up the processor (ALU, registers, data 
paths) and the micro-operations that are performed. For the control unit to perform 
its function, it must have inputs that allow it to determine the state of the system and 
outputs that allow it to control the behavior of the system. These are the external 
specifications of the control unit. Internally, the control unit must have the logic re- 
quired to perform its sequencing and execution functions. We defer a discussion of 
the internal operation of the control unit to Section 15.3 and Chapter 16. The re- 
mainder of this section is concerned with the interaction between the control unit 
and the other elements of the processor. 
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Figure 15.4 Block Diagram of the Control Unit 


Figure 15.4 is a general model of the control unit, showing all of its inputs and 
outputs. The inputs are 


¢ Clock: This is how the control unit “keeps time.” The control unit causes one 
micro-operation (or a set of simultaneous micro-operations) to be performed 
for each clock pulse. This is sometimes referred to as the processor cycle time, 
or the clock cycle time. 


e Instruction register: The opcode and addressing mode of the current instruc- 
tion are used to determine which micro-operations to perform during the exe- 
cute cycle. 

e Flags: These are needed by the control unit to determine the status of the 
processor and the outcome of previous ALU operations. For example, for the 
increment-and-skip-if-zero (ISZ) instruction, the control unit will increment 
the PC if the zero flag is set. 


¢ Control signals from control bus: The control bus portion of the system bus 
provides signals to the control unit. 


The outputs are as follows: 


e Control signals within the processor: These are two types: those that cause 
data to be moved from one register to another, and those that activate specific 
ALU functions. 


e Control signals to control bus: These are also of two types: control signals to 
memory, and control signals to the I/O modules. 


Three types of control signals are used: those that activate an ALU function, 
those that activate a data path, and those that are signals on the external system bus 
or other external interface. All of these signals are ultimately applied directly as 
binary inputs to individual logic gates. 

Let us consider again the fetch cycle to see how the control unit maintains con- 
trol. The control unit keeps track of where it is in the instruction cycle. At a given 
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point, it knows that the fetch cycle is to be performed next. The first step is to transfer 
the contents of the PC to the MAR. The control unit does this by activating the control 
signal that opens the gates between the bits of the PC and the bits of the MAR. The 
next step is to read a word from memory into the MBR and increment the PC. The con- 
trol unit does this by sending the following control signals simultaneously: 


e A control signal that opens gates, allowing the contents of the MAR onto the 
address bus 


e A memory read control signal on the control bus 


e A control signal that opens the gates, allowing the contents of the data bus to 
be stored in the MBR 


e Control signals to logic that add 1 to the contents of the PC and store the re- 
sult back to the PC 


Following this, the control unit sends a control signal that opens gates between the 
MBR and the IR. 

This completes the fetch cycle except for one thing: The control unit must de- 
cide whether to perform an indirect cycle or an execute cycle next. To decide this, it 
examines the IR to see if an indirect memory reference is made. 

The indirect and interrupt cycles work similarly. For the execute cycle, the con- 
trol unit begins by examining the opcode and, on the basis of that, decides which se- 
quence of micro-operations to perform for the execute cycle. 


A Control Signals Example 


To illustrate the functioning of the control unit, let us examine a simple example. 
Figure 15.5 illustrates the example. This is a simple processor with a single accumulator 
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Figure 15.5 Data Paths and Control Signals 
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(AC). The data paths between elements are indicated. The control paths for signals 
emanating from the control unit are not shown, but the terminations of control sig- 
nals are labeled C; and indicated by a circle. The control unit receives inputs from 
the clock, the instruction register, and flags. With each clock cycle, the control unit 
reads all of its inputs and emits a set of control signals. Control signals go to three 
separate destinations: 


e Data paths: The control unit controls the internal flow of data. For example, 
on instruction fetch, the contents of the memory buffer register are trans- 
ferred to the instruction register. For each path to be controlled, there is a 
switch (indicated by a circle in the figure). A control signal from the control 
unit temporarily opens the gate to let data pass. 


e ALU: The control unit controls the operation of the ALU by a set of con- 
trol signals. These signals activate various logic circuits and gates within 
the ALU. 


e System bus: The control unit sends control signals out onto the control lines of 
the system bus (e.g., memory READ). 


The control unit must maintain knowledge of where it is in the instruction 
cycle. Using this knowledge, and by reading all of its inputs, the control unit emits a 
sequence of control signals that causes micro-operations to occur. It uses the clock 
pulses to time the sequence of events, allowing time between events for signal levels 
to stabilize. Table 15.1 indicates the control signals that are needed for some of the 
micro-operation sequences described earlier. For simplicity, the data and control 
paths for incrementing the PC and for loading the fixed addresses into the PC and 
MAR are not shown. 


Table 15.1 Micro-operations and Control Signals 


Active Control 
Micro-operations Signals 


ty; MAR < (PC) Cy 
to: MBR — Memory Cs, Cr 
PC< (PC) + 1 

:IR — (MBR) Cy 
: MAR < (IR(Address)) Cg 
Indirect: :MBR< Memory 
: IR(Address) — (MBR(Address)) Cy 
: MBR< (PC) Cc, 
Interrupt: : MAR < Save-address 


PC <— Routine-address 
t3: Memory <— (MBR) 


Cp = Read control signal to system bus. 
Cw = Write control signal to system bus. 
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It is worth pondering the minimal nature of the control unit. The control 
unit is the engine that runs the entire computer. It does this based only on know- 
ing the instructions to be executed and the nature of the results of arithmetic and 
logical operations (e.g., positive, overflow, etc.). It never gets to see the data being 
processed or the actual results produced. And it controls everything with a few 
control signals to points within the processor and a few control signals to the sys- 
tem bus. 


Internal Processor Organization 


Figure 15.5 indicates the use of a variety of data paths. The complexity of this type of 
organization should be clear. More typically, some sort of internal bus arrangement, 
as was suggested in Figure 12.2, will be used. 

Using an internal processor bus, Figure 15.5 can be rearranged as shown in 
Figure 15.6. A single internal bus connects the ALU and all processor registers. 
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Figure 15.6 CPU with Internal Bus 
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Gates and control signals are provided for movement of data onto and off the bus 
from each register. Additional control signals control data transfer to and from the 
system (external) bus and the operation of the ALU. 

Two new registers, labeled Y and Z, have been added to the organization. 
These are needed for the proper operation of the ALU. When an operation in- 
volving two operands is performed, one can be obtained from the internal bus, 
but the other must be obtained from another source. The AC could be used for 
this purpose, but this limits the flexibility of the system and would not work with 
a processor with multiple general-purpose registers. Register Y provides temporary 
storage for the other input. The ALU is a combinatorial circuit (see Chapter 20) 
with no internal storage. Thus, when control signals activate an ALU function, 
the input to the ALU is transformed to the output. Thus, the output of the ALU 
cannot be directly connected to the bus, because this output would feed back 
to the input. Register Z provides temporary output storage. With this arrange- 
ment, an operation to add a value from memory to the AC would have the fol- 
lowing steps: 


t,: MAR <¢ (IR(address) ) 
t,: MBR < Memory 

tz: Y < (MBR) 

tg: Z< (AC) + (Y) 

ts: AC — (2) 


Other organizations are possible, but, in general, some sort of internal bus or 
set of internal buses is used. The use of common data paths simplifies the intercon- 
nection layout and the control of the processor. Another practical reason for the use 
of an internal bus is to save space. 


The Intel 8085 


To illustrate some of the concepts introduced thus far in this chapter, let us consider 
the Intel 8085. Its organization is shown in Figure 15.7. Several key components that 
may not be self-explanatory are: 


e Incrementer/decrementer address latch: Logic that can add 1 to or subtract 1 
from the contents of the stack pointer or program counter. This saves time by 
avoiding the use of the ALU for this purpose. 


e Interrupt control: This module handles multiple levels of interrupt signals. 


e Serial I/O control: This module interfaces to devices that communicate 1 bit at 
a time. 


Table 15.2 describes the external signals into and out of the 8085. These are 
linked to the external system bus. These signals are the interface between the 8085 
processor and the rest of the system (Figure 15.8). 
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Figure 15.7 Intel 8085 CPU Block Diagram 


Table 15.2. Intel 8085 External Signals 
Address and Data Signals 


High Address (A15—A8) 
The high-order 8 bits of a 16-bit address. 


Address/Data (AD7-AD0) 
The lower-order 8 bits of a 16-bit address or 8 bits of data. This multiplexing saves on pins. 


Serial Input Data (SID) 
A single-bit input to accommodate devices that transmit serially (one bit at a time). 


Serial Output Data (SOD) 
A single-bit output to accommodate devices that receive serially. 


Timing and Control Signals 


CLK (OUT) 
The system clock. The CLK signal goes to peripheral chips and synchronizes their timing. 


X1, X2 
These signals come from an external crystal or other device to drive the internal clock generator. 


Address Latch Enabled (ALE) 
Occurs during the first clock state of a machine cycle and causes peripheral chips to store the address lines. 
This allows the address module (e.g., memory, I/O) to recognize that it is being addressed. 


Status (SO, S1) 
Control signals used to indicate whether a read or write operation is taking place. 


I0/M 
Used to enable either I/O or memory modules for read and write operations. 


Read Control (RD) 
Indicates that the selected memory or I/O module is to be read and that the data bus is available for data 
transfer. 


Write Control (WR) 
Indicates that data on the data bus is to be written into the selected memory or I/O location. 


Memory and I/O Initiated Symbols 


Hold 

Requests the CPU to relinquish control and use of the external system bus. The CPU will complete 
execution of the instruction presently in the IR and then enter a hold state, during which no signals are 
inserted by the CPU to the control, address, or data buses. During the hold state, the bus may be used for 
DMA operations. 


Hold Acknowledge (HOLDA) 
This control unit output signal acknowledges the HOLD signal and indicates that the bus is now 
available. 


READY 

Used to synchronize the CPU with slower memory or I/O devices. When an addressed device asserts 
READY, the CPU may proceed with an input (DBIN) or output (WR) operation. Otherwise, the CPU enters 
a wait state until the device is ready. 


(Continued) 
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Table 15.2 Continued 
Interrupt-Related Signals 


TRAP 
Restart Interrupts (RST 7.5, 6.5, 5.5) 


Interrupt Request (INTR) 

These five lines are used by an external device to interrupt the CPU. The CPU will not honor the request if 
it is in the hold state or if the interrupt is disabled. An interrupt is honored only at the completion of an in- 
struction. The interrupts are in descending order of priority. 


Interrupt Acknowledge 
Acknowledges an interrupt. 


CPU Initialization 


RESET IN 
Causes the contents of the PC to be set to zero. The CPU resumes execution at location zero. 


RESET OUT 
Acknowledges that the CPU has been reset. The signal can be used to reset the rest of the system. 


Voltage and Ground 


VCC 
+5-volt power supply 


VSS 
Electrical ground 


The control unit is identified as having two components labeled (1) instruction 
decoder and machine cycle encoding and (2) timing and control. A discussion of the 
first component is deferred until the next section. The essence of the control unit is 
the timing and control module. This module includes a clock and accepts as inputs the 
current instruction and some external control signals. Its output consists of control 
signals to the other components of the processor plus control signals to the external 
system bus. 

The timing of processor operations is synchronized by the clock and con- 
trolled by the control unit with control signals. Each instruction cycle is divided 
into from one to five machine cycles; each machine cycle is in turn divided into 
from three to five states. Each state lasts one clock cycle. During a state, the proces- 
sor performs one or a set of simultaneous micro-operations as determined by the 
control signals. 

The number of machine cycles is fixed for a given instruction but varies from 
one instruction to another. Machine cycles are defined to be equivalent to bus ac- 
cesses. Thus, the number of machine cycles for an instruction depends on the num- 
ber of times the processor must communicate with external devices. For example, if 
an instruction consists of two 8-bit portions, then two machine cycles are required to 
fetch the instruction. If that instruction involves a 1-byte memory or I/O operation, 
then a third machine cycle is required for execution. 
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Figure 15.8 Intel 8085 Pin Configuration 


Figure 15.9 gives an example of 8085 timing, showing the value of external 
control signals. Of course, at the same time, the control unit generates internal 
control signals that control internal data transfers. The diagram shows the instruc- 
tion cycle for an OUT instruction. Three machine cycles (M,, Mo, M3) are needed. 
During the first, the OUT instruction is fetched. The second machine cycle fetches 
the second half of the instruction, which contains the number of the I/O device se- 
lected for output. During the third cycle, the contents of the AC are written out to 
the selected device over the data bus. 

The Address Latch Enabled (ALE) pulse signals the start of each machine cycle 
from the control unit. The ALE pulse alerts external circuits. During timing state T, of 
machine cycle M,, the control unit sets the IO/M signal to indicate that this is a mem- 
ory operation. Also, the control unit causes the contents of the PC to be placed on the 
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address bus (Aj5 through Ag) and the address/data bus (AD, through AD ). With the 
falling edge of the ALE pulse, the other modules on the bus store the address. 
During timing state Tz, the addressed memory module places the contents of the 
addressed memory location on the address/data bus. The control unit sets the Read 
Control (RD) signal to indicate a read, but it waits until T; to copy the data from the 
bus. This gives the memory module time to put the data on the bus and for the signal 
levels to stabilize. The final state, Ty, is a bus idle state during which the processor de- 
codes the instruction. The remaining machine cycles proceed in a similar fashion. 


15.3 HARDWIRED IMPLEMENTATION 


We have discussed the control unit in terms of its inputs, output, and functions. We 
now turn to the topic of control unit implementation. A wide variety of techniques 
have been used. Most of these fall into one of two categories: 


e Hardwired implementation 
e Microprogrammed implementation 


In a hardwired implementation, the control unit is essentially a state machine 
circuit. Its input logic signals are transformed into a set of output logic signals, which 
are the control signals. This approach is examined in this section. Microprogrammed 
implementation is the subject of Chapter 16. 


Control Unit Inputs 


Figure 15.4 depicts the control unit as we have so far discussed it. The key inputs are 
the instruction register, the clock, flags, and control bus signals. In the case of the flags 
and control bus signals, each individual bit typically has some meaning (e.g., over- 
flow). The other two inputs, however, are not directly useful to the control unit. 

First consider the instruction register. The control unit makes use of the op- 
code and will perform different actions (issue a different combination of control sig- 
nals) for different instructions. To simplify the control unit logic, there should be a 
unique logic input for each opcode. This function can be performed by a decoder, 
which takes an encoded input and produces a single output. In general, a decoder 
will have n binary inputs and 2” binary outputs. Each of the 2” different input pat- 
terns will activate a single unique output. Table 15.3 is an example for n = 4. The de- 
coder for a control unit will typically have to be more complex than that, to account 
for variable-length opcodes. An example of the digital logic used to implement a de- 
coder is presented in Chapter 20. 

The clock portion of the control unit issues a repetitive sequence of pulses. 
This is useful for measuring the duration of micro-operations. Essentially, the period 
of the clock pulses must be long enough to allow the propagation of signals along 
data paths and through processor circuitry. However, as we have seen, the control 
unit emits different control signals at different time units within a single instruction 
cycle. Thus, we would like a counter as input to the control unit, with a different con- 
trol signal being used for T,, T>, and so forth. At the end of an instruction cycle, the 
control unit must feed back to the counter to reinitialize it at T,. 


A Decoder with Four Inputs and Sixteen Outputs 
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Figure 15.10 Control Unit with Decoded Inputs 


With these two refinements, the control unit can be depicted as in Figure 15.10. 


Control Unit Logic 


To define the hardwired implementation of a control unit, all that remains is to dis- 
cuss the internal logic of the control unit that produces output control signals as a 
function of its input signals. 

Essentially, what must be done is, for each control signal, to derive a Boolean 
expression of that signal as a function of the inputs. This is best explained by exam- 
ple. Let us consider again our simple example illustrated in Figure 15.5. We saw in 
Table 15.1 the micro-operation sequences and control signals needed to control 
three of the four phases of the instruction cycle. 

Let us consider a single control signal, Cs. This signal causes data to be read 
from the external data bus into the MBR. We can see that it is used twice in Table 
15.1. Let us define two new control signals, P and Q, that have the following 
interpretation: 


PQ = 00 Fetch Cycle 
PQ = 01 Indirect Cycle 
PQ = 10 Execute Cycle 
PQ = 11 Interrupt Cycle 
Then the following Boolean expression defines Cs: 
Cs; = P:Q:T, + P-Q°-T; 


That is, the control signal C; will be asserted during the second time unit of both the 
fetch and indirect cycles. 
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This expression is not complete. C; is also needed during the execute cycle. For 
our simple example, let us assume that there are only three instructions that read 
from memory: LDA, ADD, and AND. Now we can define Cs as 


C; = P-O-T, + P-Q:T, + P-O-(LDA + ADD + AND)-T; 


This same process could be repeated for every control signal generated by the 
processor. The result would be a set of Boolean equations that define the behavior 
of the control unit and hence of the processor. 

To tie everything together, the control unit must control the state of the in- 
struction cycle. As was mentioned, at the end of each subcycle (fetch, indirect, exe- 
cute, interrupt), the control unit issues a signal that causes the timing generator to 
reinitialize and issue T;. The control unit must also set the appropriate values of 
P and Q to define the next subcycle to be performed. 

The reader should be able to appreciate that in a modern complex processor, 
the number of Boolean equations needed to define the control unit is very large. 
The task of implementing a combinatorial circuit that satisfies all of these equations 
becomes extremely difficult. The result is that a far simpler approach, known as 
microprogramming, is usually used. This is the subject of the next chapter. 


15.4 RECOMMENDED READING 


A number of textbooks treat the basic principles of control unit function; two particularly 
clear treatments are in [FARH04] and [MANO04]. 


FARH04 = Farhat, H. Digital Design and Computer Organization. Boca Raton, FL: CRC 
Press, 2004. 


MANO04 Mano, M. Logic and Computer Design Fundamentals. Upper Saddle River, 
NJ: Prentice Hall, 2004. 


15.5 KEY TERMS, REVIEW QUESTIONS, AND PROBLEMS 


Key Terms 


control bus control signal hardwired implementation 
control path control unit microoperations 


Review Questions 


15.1 Explain the distinction between the written sequence and the time sequence of an 
instruction. 


15.2 What is the relationship between instructions and micro-operations? 
15.3. What is the overall function of a processor’s control unit? 
15.4 Outline a three-step process that leads to a characterization of the control unit. 
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What basic tasks does a control unit perform? 

Provide a typical list of the inputs and outputs of a control unit. 

List three types of control signals. 

Briefly explain what is meant by a hardwired implementation of a control unit. 


Problems 


15.1 


Your ALU can add its two input registers, and it can logically complement the bits of 
either input register, but it cannot subtract. Numbers are to be stored in two’s com- 
plement representation. List the micro-operations your control unit must perform to 
cause a subtraction. 

Show the micro-operations and control signals in the same fashion as Table 15.1 for 
the processor in Figure 15.5 for the following instructions: 

Load Accumulator 

Store Accumulator 

Add to Accumulator 

AND to Accumulator 

Jump 

Jump if AC = 0 

e Complement Accumulator 

Assume that propagation delay along the bus and through the ALU of Figure 15.6 are 
20 and 100 ns, respectively. The time required for a register to copy data from the bus 
is 10 ns. What is the time that must be allowed for 

a. transferring data from one register to another? 

b. incrementing the program counter? 

Write the sequence of micro-operations required for the bus structure of Figure 15.6 
to add a number to the AC when the number is 

a. animmediate operand 

b. a direct-address operand 

c. an indirect-address operand 

A stack is implemented as shown in Figure 10.14. Show the sequence of micro- 
operations for 

a. popping 

b. pushing the stack 
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KEY POINTS 


® An alternative to a hardwired control unit is a microprogrammed control 
unit, in which the logic of the control unit is specified by a microprogram. A 
microprogram consists of a sequence of instructions in a microprogramming 
language. These are very simple instructions that specify micro-operations. 

® A microprogrammed control unit is a relatively simple logic circuit that is 
capable of (1) sequencing through microinstructions and (2) generating 
control signals to execute each microinstruction. 

® Asina hardwired control unit, the control signals generated by a microin- 
struction are used to cause register transfers and ALU operations. 


The term microprogram was first coined by M. V. Wilkes in the early 1950s [WILKS1]. 
Wilkes proposed an approach to control unit design that was organized and systematic 
and avoided the complexities of a hardwired implementation. The idea intrigued many 
researchers but appeared unworkable because it would require a fast, relatively inex- 
pensive control memory. 

The state of the microprogramming art was reviewed by Datamation in its 
February 1964 issue. No microprogrammed system was in wide use at that time, and 
one of the papers [HILL64] summarized the then-popular view that the future of 
microprogramming “is somewhat cloudy. None of the major manufacturers has ev- 
idenced interest in the technique, although presumably all have examined it.” 

This situation changed dramatically within a very few months. IBM’s System/360 
was announced in April, and all but the largest models were microprogrammed. Al- 
though the 360 series predated the availability of semiconducter ROM, the advantages 
of microprogramming were compelling enough for IBM to make this move. Micropro- 
gramming became a popular technique for implementing the control unit of CISC 
processors. In recent years, microprogramming has become less used but remains a tool 
available to computer designers. For example, as we have seen, on the Pentium 4, ma- 
chine instructions are converted into a RISC-like format most of which are executed 
without the use of microprogramming. However, some of the instructions are executed 
using microprogramming. 


16.1 BASIC CONCEPTS 


Microinstructions 


The control unit seems a reasonably simple device. Nevertheless, to implement a 
control unit as an interconnection of basic logic elements is no easy task. The design 
must include logic for sequencing through micro-operations, for executing micro- 
operations, for interpreting opcodes, and for making decisions based on ALU flags. 
It is difficult to design and test such a piece of hardware. Furthermore, the design is 
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Table 16.1 Machine Instruction Set for Wilkes Example 


Effect of Order 


C(Acc) + C(n) to Acc, 

C(Acc) — C(n) to Acc; 

C(n) to Acc, 

C(Acc) X C(n) to Acc, where C(n) = 0 
C(Acc,) to n,0 to Acc 


C(Acc;) ton 

C(Acc) X 2~* to Acc 

C(Acc) X 2"*! to Acc 

IF C(Acc) < 0, transfer control to n;if C(Acc) = 0, ignore (i-e., proceed serially) 


Read next character on input mechanism into 1 


Send C(n) to output mechanism 


Notation: Acc = accumulator 
Acc, = most significant half of accumulator 
Accy = least significant half of accumulator 
n = storage location n 
C(X) = contents of X(X = register or storage location) 


relatively inflexible. For example, it is difficult to change the design if one wishes to 
add a new machine instruction. 

An alternative, which has been used in many CISC processors, is to implement 
a microprogrammed control unit. 

Consider Table 16.1. In addition to the use of control signals, each micro-oper- 
ation is described in symbolic notation. This notation looks suspiciously like a pro- 
gramming language. In fact it is a language, known as a microprogramming language. 
Each line describes a set of micro-operations occurring at one time and is known as a 
microinstruction. A sequence of instructions is known as a microprogram, or 
firmware. This latter term reflects the fact that a microprogram is midway between 
hardware and software. It is easier to design in firmware than hardware, but it is more 
difficult to write a firmware program than a software program. 

How can we use the concept of microprogramming to implement a control 
unit? Consider that for each micro-operation, all that the control unit is allowed to 
do is generate a set of control signals. Thus, for any micro-operation, each control line 
emanating from the control unit is either on or off. This condition can, of course, be 
represented by a binary digit for each control line. So we could construct a control 
word in which each bit represents one control line. Then each micro-operation would 
be represented by a different pattern of 1s and Os in the control word. 

Suppose we string together a sequence of control words to represent the se- 
quence of micro-operations performed by the control unit. Next, we must recognize 
that the sequence of micro-operations is not fixed. Sometimes we have an indirect 
cycle; sometimes we do not. So let us put our control words in a memory, with each 
word having a unique address. Now add an address field to each control word, 
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(b) Vertical microinstruction 


Figure 16.1 Typical Microinstruction Formats 


indicating the location of the next control word to be executed if a certain condition 
is true (e.g., the indirect bit in a memory-reference instruction is 1). Also, add a few 
bits to specify the condition. 

The result is known as a horizontal microinstruction, an example of which is 
shown in Figure 16.1a. The format of the microinstruction or control word is as fol- 
lows. There is one bit for each internal processor control line and one bit for each 
system bus control line. There is a condition field indicating the condition under 
which there should be a branch, and there is a field with the address of the microin- 
struction to be executed next when a branch is taken. Such a microinstruction is in- 
terpreted as follows: 


1. To execute this microinstruction, turn on all the control lines indicated by a 
1 bit; leave off all control lines indicated by a 0 bit. The resulting control signals 
will cause one or more micro-operations to be performed. 


2. If the condition indicated by the condition bits is false, execute the next microin- 
struction in sequence. 


3. If the condition indicated by the condition bits is true, the next microinstruc- 
tion to be executed is indicated in the address field. 


Figure 16.2 shows how these control words or microinstructions could be 
arranged in a control memory. The microinstructions in each routine are to be exe- 
cuted sequentially. Each routine ends with a branch or jump instruction indicating 
where to go next. There is a special execute cycle routine whose only purpose is to 
signify that one of the machine instruction routines (AND, ADD, and so on) is to be 
executed next, depending on the current opcode. 
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Figure 16.2 Organization of Control Memory 


The control memory of Figure 16.2 is a concise description of the complete op- 
eration of the control unit. It defines the sequence of micro-operations to be per- 
formed during each cycle (fetch, indirect, execute, interrupt), and it specifies the 
sequencing of these cycles. If nothing else, this notation would be a useful device for 
documenting the functioning of a control unit for a particular computer. But it is 
more than that. It is also a way of implementing the control unit. 


Microprogrammed Control Unit 


The control memory of Figure 16.2 contains a program that describes the behavior 
of the control unit. It follows that we could implement the control unit by simply ex- 
ecuting that program. 

Figure 16.3 shows the key elements of such an implementation. The set of mi- 
croinstructions is stored in the control memory. The control address register contains 
the address of the next microinstruction to be read. When a microinstruction is read 
from the control memory, it is transferred to a control buffer register. The left-hand 
portion of that register (see Figure 16.1a) connects to the control lines emanating 
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Figure 16.3 Control Unit Microarchitecture 


from the control unit. Thus, reading a microinstruction from the control memory is 
the same as executing that microinstruction. The third element shown in the figure is 
a sequencing unit that loads the control address register and issues a read command. 
Let us examine this structure in greater detail, as depicted in Figure 16.4. Compar- 
ing this with Figure 16.4, we see that the control unit still has the same inputs (IR, ALU 
flags, clock) and outputs (control signals). The control unit functions as follows: 


1. To execute an instruction, the sequencing logic unit issues a READ command 
to the control memory. 


nN 


. The word whose address is specified in the control address register is read into 
the control buffer register. 


3. The content of the control buffer register generates control signals and next- 
address information for the sequencing logic unit. 


4. The sequencing logic unit loads a new address into the control address register 
based on the next-address information from the control buffer register and the 
ALU flags. 


All this happens during one clock pulse. 

The last step just listed needs elaboration. At the conclusion of each microin- 
struction, the sequencing logic unit loads a new address into the control address reg- 
ister. Depending on the value of the ALU flags and the control buffer register, one 
of three decisions is made: 


e Get the next instruction: Add 1 to the control address register. 


e Jump to a new routine based on a jump microinstruction: Load the address 
field of the control buffer register into the control address register. 


e Jump to a machine instruction routine: Load the control address register 
based on the opcode in the IR. 
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Figure 16.4 Functioning of Microprogrammed Control Unit 


Figure 16.4 shows two modules labeled decoder. The upper decoder translates 
the opcode of the IR into a control memory address. The lower decoder is not used 
for horizontal microinstructions but is used for vertical microinstructions (Fig- 
ure 16.1b). As was mentioned, in a horizontal microinstruction every bit in the con- 
trol field attaches to a control line. In a vertical microinstruction, a code is used for 
each action to be performed [e.g., MAR <— (PC) ], and the decoder translates this 
code into individual control signals. The advantage of vertical microinstructions is 
that they are more compact (fewer bits) than horizontal microinstructions, at the ex- 
pense of a small additional amount of logic and time delay. 


Wilkes Control 


As was mentioned, Wilkes first proposed the use of a microprogrammed control 
unit in 1951 [WILKS1]. This proposal was subsequently elaborated into a more de- 
tailed design [WILKS53]. It is instructive to examine this seminal proposal. 

The configuration proposed by Wilkes is depicted in Figure 16.5. The heart of the 
system is a matrix partially filled with diodes. During a machine cycle, one row of the 
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Figure 16.5 Wilkes’s Microprogrammed Control Unit 


matrix is activated with a pulse. This generates signals at those points where a diode is 
present (indicated by a dot in the diagram). The first part of the row generates the con- 
trol signals that control the operation of the processor. The second part generates the 
address of the row to be pulsed in the next machine cycle. Thus, each row of the matrix 
is One microinstruction, and the layout of the matrix is the control memory. 

At the beginning of the cycle, the address of the row to be pulsed is contained 
in Register I. This address is the input to the decoder, which, when activated by a 
clock pulse, activates one row of the matrix. Depending on the control signals, either 
the opcode in the instruction register or the second part of the pulsed row is passed 
into Register II during the cycle. Register II is then gated to Register I by a clock 
pulse. Alternating clock pulses are used to activate a row of the matrix and to trans- 
fer from Register II to Register I. The two-register arrangement is needed because 
the decoder is simply a combinatorial circuit; with only one register, the output 
would become the input during a cycle, causing an unstable condition. 

This scheme is very similar to the horizontal microprogramming approach de- 
scribed earlier (Figure 16.1a). The main difference is this: In the previous descrip- 
tion, the control address register could be incremented by one to get the next 
address. In the Wilkes scheme, the next address is contained in the microinstruction. 
To permit branching, a row must contain two address parts, controlled by a condi- 
tional signal (e.g., flag), as shown in the figure. 

Having proposed this scheme, Wilkes provides an example of its use to imple- 
ment the control unit of a simple machine. This example, the first known design of a 
microprogrammed processor, is worth repeating here because it illustrates many of 
the contemporary principles of microprogramming. 
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The processor of the hypothetical machine includes the following registers: 


A multiplicand 
B_ accumulator (least significant half) 
C accumulator (most significant half) 


D shift register 


In addition, there are three registers and two 1-bit flags accessible only to the con- 
trol unit. The registers are as follows: 


E serves as both a memory address register (MAR) and temporary storage 
F program counter 


G_ another temporary register; used for counting 


Table 16.1 lists the machine instruction set for this example. Table 16.2 is the 
complete set of microinstructions, expressed in symbolic form, that implements 
the control unit. Thus, a total of 38 microinstructions is all that is required to de- 
fine the system completely. 

The first full column gives the address (row number) of each microinstruction. 
Those addresses corresponding to opcodes are labeled. Thus, when the opcode for 
the add instruction (A) is encountered, the microinstruction at location 5 is executed. 
Columns 2 and 3 express the actions to be taken by the ALU and control unit, 
respectively. Each symbolic expression must be translated into a set of control sig- 
nals (microinstruction bits). Columns 4 and 5 have to do with the setting and use of 
the two flags (flip-flops). Column 4 specifies the signal that sets the flag. For exam- 
ple, (1)C, means that flag number 1 is set by the sign bit of the number in register C. 
If column 5 contains a flag identifier, then columns 6 and 7 contain the two alterna- 
tive microinstruction addresses to be used. Otherwise, column 6 specifies the ad- 
dress of the next microinstruction to be fetched. 

Instructions 0 through 4 constitute the fetch cycle. Microinstruction 4 presents 
the opcode to a decoder, which generates the address of a microinstruction corre- 
sponding to the machine instruction to be fetched. The reader should be able to de- 
duce the complete functioning of the control unit from a careful study of Table 16.2. 


Advantages and Disadvantages 


The principal advantage of the use of microprogramming to implement a control 
unit is that it simplifies the design of the control unit. Thus, it is both cheaper and 
less error prone to implement. A hardwired contro] unit must contain complex logic 
for sequencing through the many micro-operations of the instruction cycle. On the 
other hand, the decoders and sequencing logic unit of a microprogrammed control 
unit are very simple pieces of logic. 

The principal disadvantage of a microprogrammed unit is that it will be 
somewhat slower than a hardwired unit of comparable technology. Despite this, 
microprogramming is the dominant technique for implementing control units in 
pure CISC architectures, due to its ease of implementation. RISC processors, with 
their simpler instruction format, typically use hardwired control units. We now ex- 
amine the microprogrammed approach in greater detail. 
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Microinstructions for Wilkes Example 


Notation: A, B, C, ... stand for the various registers in the arithmetical and control register units. C to D 
indicates that the switching circuits connect the output of register C to the input register D; (D + A) toC 
indicates that the output register of A is connected to the one input of the adding unit (the output of D is 
permanently connected to the other input), and the output of the adder to register C. A numerical symbol n 
in quotes (e.g., ‘n’) stands for the source whose output is the number v in units of the least significant digit. 


Next Micro- 
Control Conditional Flip-Flop instruction 
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F toGand E 
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BtoD EtoG 
D to B(R) (G — ‘l’)toE 
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(Continued) 
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Table 16.2 Continued 


Next Micro- 
Control Conditional Flip-Flop instruction 


Arithmetical Unit Register Unit Set Use 0 1 


(D + A) toC 2 33 
BtoD (1)B, 
D to B(R) 
C to D(R) 
DtoC 
(D — A) toC 


*Right shift. The switching circuits in the arithmetic unit are arranged so that the least significant digit of the reg- 
ister C is placed in the most significant place of register B during right shift micro-operations, and the most signif- 
icant digit of register C (sign digit) is repeated (thus making the correction for negative numbers). 


‘Left shift. The switching circuits are similarly arranged to pass the most significant digit of register B to the least 
significant place of register C during left shift micro-operations. 


16.2 MICROINSTRUCTION SEQUENCING 


The two basic tasks performed by a microprogrammed control unit are as follows: 


e Microinstruction sequencing: Get the next microinstruction from the control 
memory. 


e Microinstruction execution: Generate the control signals needed to execute 
the microinstruction. 


In designing a control unit, these tasks must be considered together, because 
both affect the format of the microinstruction and the timing of the control unit. In 
this section, we will focus on sequencing and say as little as possible about format 
and timing issues. These issues are examined in more detail in the next section. 


Design Considerations 


Two concerns are involved in the design of a microinstruction sequencing technique: 
the size of the microinstruction and the address-generation time. The first concern is 
obvious; minimizing the size of the control memory reduces the cost of that component. 
The second concern is simply a desire to execute microinstructions as fast as possible. 

In executing a microprogram, the address of the next microinstruction to be 
executed is in one of these categories: 


e Determined by instruction register 

e Next sequential address 

e Branch 
The first category occurs only once per instruction cycle, just after an instruction is 
fetched. The second category is the most common in most designs. However, the de- 


sign cannot be optimized just for sequential access. Branches, both conditional and 
unconditional, are a necessary part of a microprogram. Furthermore, microinstruction 
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sequences tend to be short; one out of every three or four microinstructions could 
be a branch [SIEW82]. Thus, it is important to design compact, time-efficient tech- 
niques for microinstruction branching. 


Sequencing Techniques 


Based on the current microinstruction, condition flags, and the contents of the in- 
struction register, a control memory address must be generated for the next mi- 
croinstruction. A wide variety of techniques have been used. We can group them 
into three general categories, as illustrated in Figures 16.6 to 16.8. These categories 
are based on the format of the address information in the microinstruction: 


e Two address fields 
e Single address field 
e Variable format 
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Figure 16.6 Branch Control Logic: Two Address Fields 
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Figure 16.7 Branch Control Logic: Single Address Field 


The simplest approach is to provide two address fields in each microinstruc- 
tion. Figure 16.6 suggests how this information is to be used. A multiplexer is pro- 
vided that serves as a destination for both address fields plus the instruction register. 
Based on an address-selection input, the multiplexer transmits either the opcode or 
one of the two addresses to the control address register (CAR). The CAR is subse- 
quently decoded to produce the next microinstruction address. The address-selection 
signals are provided by a branch logic module whose input consists of control unit 
flags plus bits from the control portion of the microinstruction. 

Although the two-address approach is simple, it requires more bits in the 
microinstruction than other approaches. With some additional logic, savings can be 
achieved. A common approach is to have a single address field (Figure 16.7). With 
this approach, the options for next address are as follows: 


e Address field 
e Instruction register code 
e Next sequential address 
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Figure 16.8 Branch Control Logic: Variable Format 


The address-selection signals determine which option is selected. This approach 
reduces the number of address fields to one. Note, however, that the address field 
often will not be used. Thus, there is some inefficiency in the microinstruction cod- 
ing scheme. 

Another approach is to provide for two entirely different microinstruction for- 
mats (Figure 16.8). One bit designates which format is being used. In one format, the 
remaining bits are used to activate control signals. In the other format, some bits 
drive the branch logic module, and the remaining bits provide the address. With the 
first format, the next address is either the next sequential address or an address de- 
rived from the instruction register. With the second format, either a conditional or 
unconditional branch is being specified. One disadvantage of this approach is that 
one entire cycle is consumed with each branch microinstruction. With the other ap- 
proaches, address generation occurs as part of the same cycle as control signal gen- 
eration, minimizing control memory accesses. 
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Table 16.3. Microinstruction Address Generation 
Techniques 


Explicit Implicit 


Two-field Mapping 
Unconditional branch Addition 


Conditional branch Residual control 


The approaches just described are general. Specific implementations will often 
involve a variation or combination of these techniques. 


Address Generation 


We have looked at the sequencing problem from the point of view of format consid- 
erations and general logic requirements. Another viewpoint is to consider the vari- 
ous ways in which the next address can be derived or computed. 

Table 16.3 lists the various address generation techniques. These can be 
divided into explicit techniques, in which the address is explicitly available in the 
microinstruction, and implicit techniques, which require additional logic to gener- 
ate the address. 

We have essentially dealt with the explicit techniques. With a two-field ap- 
proach, two alternative addresses are available with each microinstruction. Using 
either a single address field or a variable format, various branch instructions can 
be implemented. A conditional branch instruction depends on the following types 
of information: 


e ALU flags 

e Part of the opcode or address mode fields of the machine instruction 
e Parts of a selected register, such as the sign bit 

e Status bits within the control unit 


Several implicit techniques are also commonly used. One of these, mapping, is re- 
quired with virtually all designs. The opcode portion of a machine instruction must be 
mapped into a microinstruction address. This occurs only once per instruction cycle. 

A common implicit technique is one that involves combining or adding two 
portions of an address to form the complete address. This approach was taken for 
the IBM S/360 family [TUCK67] and used on many of the S/370 models. We will use 
the IBM 3033 as an example. 

The control address register on the IBM 3033 is 13 bits long and is illustrated 
in Figure 16.9. Two parts of the address can be distinguished. The highest-order 
8 bits (00-07) normally do not change from one microinstruction cycle to the next. 
During the execution of a microinstruction, these 8 bits are copied directly from an 
8-bit field of the microinstruction (the BA field) into the highest-order 8 bits of the 
control address register. This defines a block of 32 microinstructions in control 
memory. The remaining 5 bits of the control address register are set to specify the 
specific address of the microinstruction to be fetched next. Each of these bits is 
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Figure 16.9 IBM 3033 Control Address Register 


determined by a 4-bit field (except one is a 7-bit field) in the current microinstruc- 
tion; the field specifies the condition for setting the corresponding bit. For example, 
a bit in the control address register might be set to 1 or 0 depending on whether a 
carry occurred on the last ALU operation. 

The final approach listed in Table 16.3 is termed residual control. This approach 
involves the use of a microinstruction address that has previously been saved in 
temporary storage within the control unit. For example, some microinstruction sets 
come equipped with a subroutine facility. An internal register or stack of registers is 
used to hold return addresses. An example of this approach is taken on the LSI-11, 
which we now examine. 


LSI-11 Microinstruction Sequencing 


The LSI-11 is a microcomputer version of a PDP-11, with the main components of 
the system residing on a single board. The LSI-11 is implemented using a micropro- 
grammed control unit [SEBE76]. 

The LSI-11 makes use of a 22-bit microinstruction and a control memory 
of 2K 22-bit words. The next microinstruction address is determined in one of 
five ways: 


e Next sequential address: In the absence of other instructions, the control unit’s 
control address register is incremented by 1. 


¢ Opcode mapping: At the beginning of each instruction cycle, the next microin- 
struction address is determined by the opcode. 


e Subroutine facility: Explained presently. 


e Interrupt testing: Certain microinstructions specify a test for interrupts. If an 
interrupt has occurred, this determines the next microinstruction address. 


e Branch: Conditional and unconditional branch microinstructions are used. 


A one-level subroutine facility is provided. One bit in every microinstruction 
is dedicated to this task. When the bit is set, an 11-bit return register is loaded with 
the updated contents of the control address register. A subsequent microinstruction 
that specifies a return will cause the control address register to be loaded from the 
return register. 

The return is one form of unconditional branch instruction. Another form of 
unconditional branch causes the bits of the control address register to be loaded 
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from 11 bits of the microinstruction. The conditional branch instruction makes use 
of a 4-bit test code within the microinstruction. This code specifies testing of various 
ALU condition codes to determine the branch decision. If the condition is not true, 
the next sequential address is selected. If it is true, the 8 lowest-order bits of the con- 
trol address register are loaded from 8 bits of the microinstruction. This allows 
branching within a 256-word page of memory. 

As can be seen, the LSI-11 includes a powerful address sequencing facility 
within the control unit. This allows the microprogrammer considerable flexibility 
and can ease the microprogramming task. On the other hand, this approach requires 
more control unit logic than do simpler capabilities. 


16.3 MICROINSTRUCTION EXECUTION 


The microinstruction cycle is the basic event on a microprogrammed processor. 
Each cycle is made up of two parts: fetch and execute. The fetch portion is deter- 
mined by the generation of a microinstruction address, and this was dealt with in the 
preceding section. This section deals with the execution of a microinstruction. 

Recall that the effect of the execution of a microinstruction is to generate con- 
trol signals. Some of these signals control points internal to the processor. The re- 
maining signals go to the external control bus or other external interface. As an 
incidental function, the address of the next microinstruction is determined. 

The preceding description suggests the organization of a control unit shown in 
Figure 16.10. This slightly revised version of Figure 16.4 emphasizes the focus of this 
section. The major modules in this diagram should by now be clear. The sequencing 
logic module contains the logic to perform the functions discussed in the preceding 
section. It generates the address of the next microinstruction, using as inputs the in- 
struction register, ALU flags, the control address register (for incrementing), and 
the control buffer register. The last may provide an actual address, control bits, or 
both. The module is driven by a clock that determines the timing of the microin- 
struction cycle. 

The control logic module generates control signals as a function of some of the 
bits in the microinstruction. It should be clear that the format and content of the mi- 
croinstruction will determine the complexity of the control logic module. 


A Taxonomy of Microinstructions 


Microinstructions can be classified in a variety of ways. Distinctions that are com- 
monly made in the literature include the following: 

e Vertical/horizontal 

e Packed/unpacked 

e Hard/soft microprogramming 

e Direct/indirect encoding 


All of these bear on the format of the microinstruction. None of these terms has 
been used in a consistent, precise way in the literature. However, an examination 
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Figure 16.10 Control Unit Organization 


of these pairs of qualities serves to illuminate microinstruction design alterna- 
tives. In the following paragraphs, we first look at the key design issue underlying 
all of these pairs of characteristics, and then we look at the concepts suggested by 
each pair. 

In the original proposal by Wilkes [WILK51], each bit of a microinstruction ei- 
ther directly produced a control signal or directly produced one bit of the next ad- 
dress. We have seen, in the preceding section, that more complex address sequencing 
schemes, using fewer microinstruction bits, are possible. These schemes require a 
more complex sequencing logic module. A similar sort of trade-off exists for the por- 
tion of the microinstruction concerned with control signals. By encoding control in- 
formation, and subsequently decoding it to produce control signals, control word bits 
can be saved. 

How can this encoding be done? To answer that, consider that there are a 
total of K different internal and external control signals to be driven by the con- 
trol unit. In Wilkes’s scheme, K bits of the microinstruction would be dedicated 
to this purpose. This allows all of the 2“ possible combinations of control signals 
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to be generated during any instruction cycle. But we can do better than this if we 
observe that not all of the possible combinations will be used. Examples include 
the following: 


e Two sources cannot be gated to the same destination (e.g., C, and Cg in 
Figure 16.5). 


e A register cannot be both source and destination (e.g.,C; and Cy, in Figure 16.5). 
e Only one pattern of control signals can be presented to the ALU at a time. 


e Only one pattern of control signals can be presented to the external control 
bus at a time. 


So, for a given processor, all possible allowable combinations of control signals 
could be listed, giving some number Q < 2* possibilities. These could be encoded 
with log, Q bits, with (log, Q) < K. This would be the tightest possible form of en- 
coding that preserves all allowable combinations of control signals. In practice, this 
form of encoding is not used, for two reasons: 


e It is as difficult to program as a pure decoded (Wilkes) scheme. This point is 
discussed further presently. 


e It requires a complex and therefore slow control logic module. 
Instead, some compromises are made. These are of two kinds: 


e More bits than are strictly necessary are used to encode the possible combi- 
nations. 


e Some combinations that are physically allowable are not possible to encode. 


The latter kind of compromise has the effect of reducing the number of bits. The net 
result, however, is to use more than log, Q bits. 

In the next subsection, we will discuss specific encoding techniques. The re- 
mainder of this subsection deals with the effects of encoding and the various terms 
used to describe it. 

Based on the preceding, we can see that the control signal portion of the mi- 
croinstruction format falls on a spectrum. At one extreme, there is one bit for each 
control signal; at the other extreme, a highly encoded format is used. Table 16.4 
shows that other characteristics of a microprogrammed control unit also fall along a 
spectrum and that these spectra are, by and large, determined by the degree-of- 
encoding spectrum. 

The second pair of items in the table is rather obvious. The pure Wilkes 
scheme will require the most bits. It should also be apparent that this extreme pre- 
sents the most detailed view of the hardware. Every control signal is individually 
controllable by the microprogrammer. Encoding is done in such a way as to aggre- 
gate functions or resources, so that the microprogrammer is viewing the processor at 
a higher, less detailed level. Furthermore, the encoding is designed to ease the mi- 
croprogramming burden. Again, it should be clear that the task of understanding 
and orchestrating the use of all the control signals is a difficult one. As was men- 
tioned, one of the consequences of encoding, typically, is to prevent the use of certain 
otherwise allowable combinations. 
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Table 16.4 The Microinstruction Spectrum 


Characteristics 


Unencoded Highly encoded 

Many bits Few bits 

Detailed view of hardware Aggregated view of hardware 
Difficult to program Easy to program 


Concurrency fully exploited Concurrency not fully exploited 


Little or no control logic Complex control logic 


Fast execution Slow execution 


Optimize performance Optimize programming 


Terminology 


Unpacked Packed 
Horizontal Vertical 
Hard Soft 


The preceding paragraph discusses microinstruction design from the micro- 
programmer’s point of view. But the degree of encoding also can be viewed from its 
hardware effects. With a pure unencoded format, little or no decode logic is needed; 
each bit generates a particular control signal. As more compact and more aggregated 
encoding schemes are used, more complex decode logic is needed. This, in turn, may 
affect performance. More time is needed to propagate signals through the gates of 
the more complex control logic module. Thus, the execution of encoded microin- 
structions takes longer than the execution of unencoded ones. 

Thus, all of the characteristics listed in Table 16.4 fall along a spectrum of de- 
sign strategies. In general, a design that falls toward the left end of the spectrum is 
intended to optimize the performance of the control unit. Designs toward the 
right end are more concerned with optimizing the process of microprogramming. 
Indeed, microinstruction sets near the right end of the spectrum look very much 
like machine instruction sets. A good example of this is the LSI-11 design, de- 
scribed later in this section. Typically, when the objective is simply to implement a 
control unit, the design will be near the left end of the spectrum. The IBM 3033 de- 
sign, discussed presently, is in this category. As we shall discuss later, some systems 
permit a variety of users to construct different microprograms using the same mi- 
croinstruction facility. In the latter cases, the design is likely to fall near the right 
end of the spectrum. 

We can now deal with some of the terminology introduced earlier. Table 16.4 
indicates how three of these pairs of terms relate to the microinstruction spectrum. 
In essence, all of these pairs describe the same thing but emphasize different design 
characteristics. 

The degree of packing relates to the degree of identification between a given 
control task and specific microinstruction bits. As the bits become more packed, a 
given number of bits contains more information. Thus, packing connotes encoding. 
The terms horizontal and vertical relate to the relative width of microinstructions. 
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[SIEW872] suggests as a rule of thumb that vertical microinstructions have lengths in 
the range of 16 to 40 bits and that horizontal microinstructions have lengths in the 
range of 40 to 100 bits. The terms hard and soft microprogramming are used to sug- 
gest the degree of closeness to the underlying control signals and hardware layout. 
Hard microprograms are generally fixed and committed to read-only memory. Soft 
microprograms are more changeable and are suggestive of user microprogramming. 

The other pair of terms mentioned at the beginning of this subsection refers to 
direct versus indirect encoding, a subject to which we now turn. 


Microinstruction Encoding 


In practice, microprogrammed control units are not designed using a pure unencod- 
ed or horizontal microinstruction format. At least some degree of encoding is used to 
reduce control memory width and to simplify the task of microprogramming. 

The basic technique for encoding is illustrated in Figure 16.11a. The microin- 
struction is organized as a set of fields. Each field contains a code, which, upon de- 
coding, activates one or more control signals. 
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(a) Direct encoding 
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Control signals 


(b) Indirect encoding 


Figure 16.11 Microinstruction Encoding 
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Let us consider the implications of this layout. When the microinstruction is 
executed, every field is decoded and generates control signals. Thus, with N fields, 
N simultaneous actions are specified. Each action results in the activation of one 
or more control signals. Generally, but not always, we will want to design the 
format so that each control signal is activated by no more than one field. Clearly, 
however, it must be possible for each control signal to be activated by at least 
one field. 

Now consider the individual field. A field consisting of L bits can contain one 
of 2" codes, each of which can be encoded to a different control signal pattern. 
Because only one code can appear in a field at a time, the codes are mutually exclu- 
sive, and, therefore, the actions they cause are mutually exclusive. 

The design of an encoded microinstruction format can now be stated in 
simple terms: 


e Organize the format into independent fields. That is, each field depicts a set of 
actions (pattern of control signals) such that actions from different fields can 
occur simultaneously. 


e Define each field such that the alternative actions that can be specified by the 
field are mutually exclusive. That is, only one of the actions specified for a 
given field could occur at a time. 


Two approaches can be taken to organizing the encoded microinstruction into 
fields: functional and resource. The functional encoding method identifies functions 
within the machine and designates fields by function type. For example, if various 
sources can be used for transferring data to the accumulator, one field can be desig- 
nated for this purpose, with each code specifying a different source. Resource encod- 
ing views the machine as consisting of a set of independent resources and devotes 
one field to each (e.g., I/O, memory, ALU). 

Another aspect of encoding is whether it is direct or indirect (Figure 16.11b). 
With indirect encoding, one field is used to determine the interpretation of another 
field. For example, consider an ALU that is capable of performing eight different 
arithmetic operations and eight different shift operations. A 1-bit field could be used 
to indicate whether a shift or arithmetic operation is to be used; a 3-bit field would 
indicate the operation. This technique generally implies two levels of decoding, in- 
creasing propagation delays. 

Figure 16.12 is a simple example of these concepts. Assume a processor with 
a single accumulator and several internal registers, such as a program counter and 
a temporary register for ALU input. Figure 16.12a shows a highly vertical format. 
The first 3 bits indicate the type of operation, the next 3 encode the operation, 
and the final 2 select an internal register. Figure 16.12b is a more horizontal ap- 
proach, although encoding is still used. In this case, different functions appear in 
different fields. 


LSI-11 Microinstruction Execution 


The LSI-11 [SEBE76] is a good example of a vertical microinstruction approach. 
We look first at the organization of the control unit, then at the microinstruc- 
tion format. 


608 CHAPTER 16 / MICROPROGRAMMED CONTROL 


Simple register transfers 


0,0, 0] 0 fr 0,0 MDR < Register 
0,0, 0/0 r 0,1 Register — MDR 
0,0, 0] 0 fr 1,0 MAR €¢ Register 
Lys 
Register 
Memory operations Bole 
0,0,1/0,0,0 Read 
0,0,1)0 r 0,1 Write 
Special sequencing operations 
0,1, 0) 0 r 0,0 CSAR < Decoded MDR 
0,1,0;0 fr 0,1 CSAR < Constant (in next byte) 
0,1,0 0) 1,0 Skip 
ALU operations 
0,1,1)0,0,0 ACC < ACC + Register 
0,1,1)]0 r 0,1 ACC < ACC — Register 
0,1,1)0 r 1,0 ACC < Register 
0,1,1)0 r 1,1 Register — ACC 
0,1,1)1 fr 0,0 ACC < Register + 1 
Lys 
Register 
select 


(a) Vertical microinstruction format 


012 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 


Field 1 2 3 4 5 6 


Field definition 
1—register transfer 4—ALU operation 
2—memory operation 5— register selection 
3—sequencing operation 6—Constant 


(b) Horizontal microinstruction format 


Figure 16.12 Alternative Microinstruction Formats for a 
Simple Machine 
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LSI-11 CONTROL UNIT ORGANIZATION The LSI-11 is the first member of the 
PDP-11 family that was offered as a single-board processor. The board contains 
three LSI chips, an internal bus known as the microinstruction bus (MIB), and some 
additional interfacing logic. 

Figure 16.13 depicts, in simplified form, the organization of the LSI-11 
processor. The three chips are the data, control, and control store chips. The data 
chip contains an 8-bit ALU, twenty-six 8-bit registers, and storage for several con- 
dition codes. Sixteen of the registers are used to implement the eight 16-bit 
general-purpose registers of the PDP-11. Others include a program status word, 
memory address register (MAR), and memory buffer register. Because the ALU 
deals with only 8 bits at a time, two passes through the ALU are required 
to implement a 16-bit PDP-11 arithmetic operation. This is controlled by the 
microprogram. 

The control store chip or chips contain the 22-bit-wide control memory. The 
control chip contains the logic for sequencing and executing microinstructions. It 
contains the control address register, the control data register, and a copy of the ma- 
chine instruction register. 


Control 
store 


Microinstruction 
bus 


Control Data 
chip chip 
4 
16 


Bus control 
and other 
processor 

board logic 


Bus logic 


LSI-11 system 
bus 


Z With no number indicated, a path 
with multiple signals 


Figure 16.13 Simplified Block Diagram of the LSI-11 Processor 
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Control data register 


Control 
store 


Control address register 


Microprogram 


sequence 
Return register oa 4 


control 


Translation 
array 


Instruction register 


INT 


Figure 16.14 Organization of the LSI-11 Control Unit 


The MIB ties all the components together. During microinstruction fetch, the 
control chip generates an 11-bit address onto the MIB. Control store is accessed, 
producing a 22-bit microinstruction, which is placed on the MIB. The low-order 
16 bits go to the data chip, while the low-order 18 bits go to the control chip. The 
high-order 4 bits control special processor board functions. 

Figure 16.14 provides a still simplified but more detailed look at the LSI-11 
control unit: The figure ignores individual chip boundaries. The address sequencing 
scheme described in Section 16.2 is implemented in two modules. Overall sequence 
control is provided by the microprogram sequence control module, which is capable 
of incrementing the microinstruction address register and performing uncondition- 
al branches. The other forms of address calculation are carried out by a separate 
translation array. This is a combinatorial circuit that generates an address based on 
the microinstruction, the machine instruction, the microinstruction program counter, 
and an interrupt register. 
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The translation array comes into play on the following occasions: 


e The opcode is used to determine the start of a microroutine. 


e At appropriate times, address mode bits of the microinstruction are tested to 
perform appropriate addressing. 


e Interrupt conditions are periodically tested. 


e Conditional branch microinstructions are evaluated. 


LSI-11 MICROINSTRUCTION FORMAT The LSI-11 uses an extremely vertical mi- 
croinstruction format, which is only 22 bits wide. The microinstruction set strongly re- 
sembles the PDP-11 machine instruction set that it implements. This design was 
intended to optimize the performance of the control unit within the constraint of a ver- 
tical, easily programmed design. Table 16.5 lists some of the LSI-11 microinstructions. 

Figure 16.15 shows the 22-bit LSI-11 microinstruction format. The high-order 
4 bits control special functions on the processor board. The translate bit enables the 
translation array to check for pending interrupts. The load return register bit is used 
at the end of a microroutine to cause the next microinstruction address to be loaded 
from the return register. 


Table 16.5 Some LSI-11 Microinstructions 


Arithmetic Operations 


Add word (byte, literal) 

Test word (byte, literal) 
Increment word (byte) by 1 
Increment word (byte) by 2 
Negate word (byte) 
Conditionally increment (decrement) byte 
Conditionally add word (byte) 
Add word (byte) with carry 
Conditionally add digits 
Subtract word (byte) 

Compare word (byte, literal) 
Subtract word (byte) with carry 
Decrement word (byte) by 1 


Logical Operations 


AND word (byte, literal) 
Test word (byte) 

OR word (byte) 
Exclusive-OR word (byte) 
Bit clear word (byte) 


Shift word (byte) right (left) with (without) carry 


Complement word (byte) 


General Operations 


MOV word (byte) 

Jump 

Return 

Conditional jump 

Set (reset) flags 

Load G low 

Conditionally MOV word (byte) 


Input/Output Operations 
Input word (byte) 
Input status word (byte) 
Read 
Write 
Read (write) and increment word (byte) by 1 
Read (write) and increment word (byte) by 2 
Read (write) acknowledge 


Output word (byte, status) 
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4 1 1 16 
ral sal Encoded micro-operations 
‘unctions 
Load return register 
Translate 


(a) Format of the full LSI-11 microinstruction 


5 11 


Opcode Jump address 


Unconditional jump microinstruction format 


4 4 8 


Opcode Test code Jump address 


Conditional jump microinstruction format 


4 8 4 


Opcode Lteral value A register 


Literal microinstruction format 


8 4 4 


Opcode B register | A register 


Register jump microinstruction format 
(b) Format of the encoded part of the LSI-11 microinstruction 


Figure 16.15 LSI-11 Microinstruction Format 


The remaining 16 bits are used for highly encoded micro-operations. The 
format is much like a machine instruction, with a variable-length opcode and one or 
more operands. 


IBM 3033 Microinstruction Execution 


The standard IBM 3033 control memory consists of 4K words. The first half of these 
(0000-07FF) contain 108-bit microinstructions, while the remainder (0800-OFFF) 
are used to store 126-bit microinstructions. The format is depicted in Figure 16.16. 
Although this is a rather horizontal format, encoding is still extensively used. The 
key fields of that format are summarized in Table 16.6. 

The ALU operates on inputs from four dedicated, non-user-visible registers, A, 
B, C, and D. The microinstruction format contains fields for loading these registers 
from user-visible registers, performing an ALU function, and specifying a user-visible 
register for storing the result. There are also fields for loading and storing data be- 
tween registers and memory. 

The sequencing mechanism for the IBM 3033 was discussed in Section 16.2. 


£19 


35 


0 
P AA AB AC AD AE AF AG AH AJ AK AL 
~< aa ><. >| 
A, B, C, D registers Arithmetic Shift 
36 71 
|~< |e >| 
Next address Storage address 
72 107 
P| BH CA CB cc CD CE CF CG CH 
~« >|~< >|~< >|~< >| 
Shift control Miscellaneous controls 
Storage address Local storage 
108 125 
P| DA DB DC DD DE 
< >} 
Testing and condition code setting 


Figure 16.16 


IBM 3033 Microinstruction Format 
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Table 16.6 IBM 3033 Microinstruction Control Fields 


ALU Control Fields 


Load A register from one of data registers 
Load B register from one of data registers 
Load C register from one of data registers 
Load D register from one of data registers 
Route specified A bits to ALU 

Route specified B bits to ALU 

Specifies ALU arithmetic operation on A input 
Specifies ALU arithmetic operation on B input 
Specifies D or B input to ALU on B side 
Route arithmetic output to shifter 

Load F register 

Activate shifter 

Specifies logical and carry functions 


Specifies shift amount 


Sequencing and Branching Fields 


End operation and perform branch 

Set high-order bits (00-07) of control address register 
Specifies condition for setting bit 8 of control address register 
Specifies condition for setting bit 9 of control address register 
Specifies condition for setting bit 10 of control address register 


Specifies condition for setting bit 11 of control address register 


Specifies condition for setting bit 12 of control address register 


16.4 TI 8800 


The Texas Instruments 8800 Software Development Board (SDB) is a micropro- 
grammable 32-bit computer card. The system has a writable control store, imple- 
mented in RAM rather than ROM. Such a system does not achieve the speed or 
density of a microprogrammed system with a ROM control store. However, it is use- 
ful for developing prototypes and for educational purposes. 

The 8800 SDB consists of the following components (Figure 16.17): 


e Microcode memory 

e Microsequencer 

e 32-bit ALU 

e Floating-point and integer processor 
e Local data memory 


Two buses link the internal components of the system. The DA bus provides 
data from the microinstruction data field to the ALU, the floating-point processor, 
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Figure 16.17 TI8800 Block Diagram 


or the microsequencer. In the latter case, the data consists of an address to be used 
for a branch instruction. The bus can also be used for the ALU or microsequencer to 
provide data to other components. The System Y bus connects the ALU and floating- 
point processor to local memory and to external modules via the PC interface. 

The board fits into an IBM PC-compatible host computer. The host computer 
provides a suitable platform for microcode assembly and debug. 


Microinstruction Format 


The microinstruction format for the 8800 consists of 128 bits broken down into 
30 functional fields, as indicated in Table 16.7. Each field consists of one or more 
bits, and the fields are grouped into five major categories: 

e Control of board 


¢ 8847 floating-point and integer processor chip 


TI 8800 Microinstruction Format 


Field Number 
Number of Bits Description 


Control of Board 


Select condition code input 

Enable/disable external I/O request signal 
Enable/disable local data memory read/write operations 
Load status/do no load status 

Determine unit driving Y bus 


Determine unit driving DA bus 


8847 Floating Point and Integer Processing Chip 


C register control: clock, do not clock 

Select most significant or least significant bits for Y bus 
C register data source: ALU, multiplexer 

Select IEEE or FAST mode for ALU and MUL 


Select sources for data operands: RA registers, RB registers, P register, 5 register, 
C register 


RB register control: clock, do not clock 
RA register control: clock, do not clock 
Data source confirmation 
Enable/disable pipeline registers 

8847 ALU function 


8832 Registered ALU 


Write enable/disable data output to selected register: most significant half, least 
significant half 


Select register file data source: DA bus, DB bus, ALU Y MUX output, system Y bus 
Shift instruction modifier 

Carry in: force, do not force 

Set ALU configuration mode: 32, 16, or 8 bits 

Select input to 5 multiplexer: register file, DB bus, MO register 

Select input to R multiplexer: register file, DA bus 

Select register in file C for write 

Select register in file B for read 

Select register in file A for write 

ALU function 


ODNDADA A FPN YN FP WN 


8818 Microsequencer 


Control input signals to the 8818 
WCS Data Field 


Most significant bits of writable control store data field 


Least significant bits of writable control store data field 
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e 8832 registered ALU 
e 8818 microsequencer 
e WCS data field 


As indicated in Figure 16.17, the 32 bits of the WCS data field are fed into the DA 
bus to be provided as data to the ALU, floating-point processor, or microsequencer. 
The other 96 bits (fields 1-27) of the microinstruction are control signals that are fed 
directly to the appropriate module. For simplicity, these other connections are not 
shown in Figure 16.17. 

The first six fields deal with operations that pertain to the control of the 
board, rather than controlling an individual component. Control operations in- 
clude the following: 


e Selecting condition codes for sequencer control. The first bit of field 1 indi- 
cates whether the condition flag is to be set to 1 or 0, and the remaining 4 bits 
indicate which flag is to be set. 

e Sending an I/O request to the PC/AT. 


Enabling local data memory read/write operations. 


e Determining the unit driving the system Y bus. One of the four devices at- 
tached to the bus (Figure 16.17) is selected. 


The last 32 bits are the data field, which contain information specific to a par- 
ticular microinstruction. 

The remaining fields of the microinstruction are best discussed in the context 
of the device that they control. In the remainder of this section, we discuss the mi- 
crosequencer and the registered ALU. The floating-point unit introduces no new 
concepts and is skipped. 


Microsequencer 


The principal function of the 8818 microsequencer is to generate the next microin- 
struction address for the microprogram. This 15-bit address is provided to the mi- 
crocode memory (Figure 16.17). 

The next address can be selected from one of five sources: 


1. The microprogram counter (MPC) register, used for repeat (reuse same ad- 
dress) and continue (increment address by 1) instructions. 


2. The stack, which supports microprogram subroutine calls as well as iterative 
loops and returns from interrupts. 

3. The DRA and DRB ports, which provide two additional paths from external 
hardware by which microprogram addresses can be generated. These two 
ports are connected to the most significant and least significant 16 bits of the 
DA bus, respectively. This allows the microsequencer to obtain the next in- 
struction address from the WCS data field of the current microinstruction or 
from a result calculated by the ALU. 


4. Register counters RCA and RCB, which can be used for additional address 
storage. 


5. An external input onto the bidirectional Y port to support external interrupts. 
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Figure 16.18 TI8818 Microsequencer 


Figure 16.18 is a logical block diagram of the 8818. The device consists of the 
following principal functional groups: 


e A 16-bit microprogram counter (MPC) consisting of a register and an 
incrementer 


e Two register counters, RCA and RCB, for counting loops and iterations, stor- 
ing branch addresses, or driving external devices 


e A 65-word by 16-bit stack, which allows microprogram subroutine calls and 
interrupts 
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e An interrupt return register and Y output enable for interrupt processing at 
the microinstruction level 


e AY output multiplexer by which the next address can be selected from MPC, 
RCA, RCB, external buses DRA and DRB, or the stack 


REGISTERS/ COUNTERS The registers RCA and RCB may be loaded from the DA 
bus, either from the current microinstruction or from the output of the ALU. The 
values may be used as counters to control the flow of execution and may be auto- 
matically decremented when accessed. The values may also be used as microinstruc- 
tion addresses to be supplied to the Y output multiplexer. Independent control of 
both registers during a single microinstruction cycle is supported with the exception 
of simultaneous decrement of both registers. 


STACK The stack allows multiple levels of nested calls or interrupts, and it can be 
used to support branching and looping. Keep in mind that these operations refer to 
the control unit, not the overall processor, and that the addresses involved are those 
of microinstructions in the control memory. 

Six stack operations are possible: 


. Clear, which sets the stack pointer to zero, emptying the stack 


. Pop, which decrements the stack pointer 


eo nN = 


. Push, which puts the contents of the MPC, interrupt return register, or DRA bus 
onto the stack and increments the stack pointer 

4. Read, which makes the address indicated by the read pointer available at the Y 

output multiplexer 


oa) 


. Hold, which causes the address of the stack pointer to remain unchanged 


6. Load stack pointer, which inputs the seven least significant bits of DRA to the 
stack pointer 


CONTROL OF MICROSEQUENCER The microsequencer is controlled primarily by 
the 12-bit field of the current microinstruction, field 28 (Table 16.7). This field 
consists of the following subfields: 


¢ OSEL (1 bit): Output select. Determines which value will be placed on the output 
of the multiplexer that feeds into the DRA bus (upper-left-hand corner of Figure 
16.18). The output is selected to come from either the stack or from register RCA. 
DRA then serves as input to either the Y output multiplexer or to register RCA. 


SELDR (1 bit): Select DR bus. If set to 1, this bit selects the external DA bus 
as input to the DRA/DRB buses. If set to 0, selects the output of the DRA 
multiplexer to the DRA bus (controlled by OSEL) and the contents of RCB 
to the DRB bus. 

ZEROIN (1 bit): Used to indicate a conditional branch. The behavior of the 
microsequencer will then depend on the condition code selected in field 1 
(Table 16.7). 

RC2-RCO0 (3 bits): Register controls. These bits determine the change in the 


contents of registers RCA and RCB. Each register can either remain the same, 
decrement, or load from the DRA/DRB buses. 
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S2-S0 (3 bits): Stack controls. These bits determine which stack operation is to 
be performed. 


MUX2-MUX0: Output controls. These bits, together with the condition code if 
used, control the Y output multiplexer and therefore the next microinstruction ad- 
dress. The multiplexer can select its output from the stack, DRA, DRB, or MPC. 


These bits can be set individually by the programmer. However, this is typically 


not done. Rather, the programmer uses mnemonics that equate to the bit patterns 
that would normally be required. Table 16.8 lists the 15 mnemonics for field 28. A 
microcode assembler converts these into the appropriate bit patterns. 


As an example, the instruction INC88181 is used to cause the next microin- 


struction in sequence to be selected, if the currently selected condition code is 1. 
From Table 16.8, we have 


INC88181 = 000000111110 


which decodes directly into 


OSEL = 0: Selects RCA as output from DRA output MUX; in this case the 
selection is irrelevant. 


SELDR = 0: As defined previously; again, this is irrelevant for this instruction. 


ZEROIN = 0: Combined with the value for MUX, indicates no branch 
should be taken. 


R = 000: Retain current value of RA and RC. 
S = 111: Retain current state of stack. 


MUX = 110: Choose MPC when condition code = 1, DRA when condition 
code = 0. 


Table 16.8 TI 8818 Microsequencer Microinstruction Bits (Field 28) 


Mnemonic 


Value 


Description 


RST8818 
BRA88181 
BRA88180 
INC88181 
INC88180 
CAL88181 
CAL88180 
RET8818 
PUSH8818 
POP8818 
LOADDRA 
LOADDRB 
LOADDRAB 
DECRDRA 
DECRDRB 


000000000110 
011000111000 
010000111110 
000000111110 
001000001000 
010000110000 
010000101110 
000000011010 
000000110111 
100000010000 
000010111110 
000110111110 
000110111100 
010001111100 
010101111100 


Reset Instruction 

Branch to DRA Instruction 

Branch to DRA Instruction 

Continue Instruction 

Continue Instruction 

Jump to Subroutine at Address Specified by DRA 

Jump to Subroutine at Address Specified by DRA 

Return from Subroutine 

Push Interrupt Return Address onto Stack 

Return from Interrupt 

Load DRA Counter from DA Bus 

Load DRB Counter from DA Bus 

Load DRA/DRB 

Decrement DRA Counter and Branch If Not Zero 
Decrement DRB Counter and Branch If Not Zero 
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Registered ALU 


The 8832 is a 32-bit ALU with 64 registers that can be configured to operate as four 
8-bit ALUs, two 16-bit ALUs, or a single 32-bit ALU. 

The 8832 is controlled by the 39 bits that make up fields 17 through 27 of the 
microinstruction (Table 16.7); these are supplied to the ALU as control signals. In 
addition, as indicated in Figure 16.17, the 8832 has external connections to the 32-bit 
DA bus and the 32-bit system Y bus. Inputs from the DA can be provided simulta- 
neously as input data to the 64-word register file and to the ALU logic module. 
Input from the system Y bus is provided to the ALU logic module. Results of the 
ALU and shift operations are output to the DA bus or the system Y bus. Results can 
also be fed back to the internal register file. 

Three 6-bit address ports allow a two-operand fetch and an operand write to 
be performed within the register file simultaneously. An MQ shifter and MQ regis- 
ter can also be configured to function independently to implement double-precision 
8-bit, 16-bit, and 32-bit shift operations. 

Fields 17 through 26 of each microinstruction control the way in which data 
flows within the 8832 and between the 8832 and the external environment. The 
fields are as follows: 


17. Write Enable. These two bits specify write 32 bits, or 16 most significant bits, 
or 16 least significant bits, or do not write into register file. The destination reg- 
ister is defined by field 24. 


18. Select Register File Data Source. If a write is to occur to the register file, these 
two bits specify the source: DA bus, DB bus, ALU output, or system Y bus. 


19. Shift Instruction Modifier. Specifies options concerning supplying end fill bits 
and reading bits that are shifted during shift instructions. 


20. Carry In. This bit indicates whether a bit is carried into the ALU for this 
operation. 


21. ALU Configuration Mode. The 8832 can be configured to operate as a single 
32-bit ALU, two 16-bit ALUs, or four 8-bit ALUs. 


22. S Input. The ALU logic module inputs are provided by two internal multi- 
plexers referred to as the S and R multiplexers. This field selects the input to 
be provided by the S multiplexer: register file, DB bus, or MQ register. The 
source register is defined by field 25. 

23. R Input. Selects input to be provided by the R multiplexer: register file or 
DA bus. 

24. Destination Register. Address of register in register file to be used for the des- 
tination operand. 

25. Source Register. Address of register in register file to be used for the source 
operand, provided by the S multiplexer. 


26. Source Register. Address of register in register file to be used for the source 
operand, provided by the R multiplexer. 


Finally, field 27 is an 8-bit opcode that specifies the arithmetic or logical func- 
tion to be performed by the ALU. Table 16.9 lists the different operations that can 
be performed. 


TI 8832 Registered ALU Instruction Field (Field 27) 


Function 


R+S+Cn 


(NOT R) +S + Cn 


R = (NOTS) + Cn 


S+Cn 


(NOT S) + Cn 


R+Cn 


(NOT R) + Cn 


RXORS 


RAND S 


RORS 


RNANDS 


RNORS 


(NOT R) AND S 


Function 


SRA 


Arithmetic right single precision shift 


SRAD 


Arithmetic right double precision shift 


SRL 


Logical right single precision shift 


SRLD 


Logical right double precision shift 


SLA 


Arithmetic left single precision shift 


SLAD 


Arithmetic left double precision shift 


SLC 


Circular left single precision shift 


SLCD 


Circular left double precision shift 


SRC 


Circular right single precision shift 


SRCD 


Circular right double precision shift 


MQSRA 


Arithmetic right shift MQ register 


MQSRL 


Logical right shift MO register 


MOQOSLL 


Logical left shift MQ register 


MOQSLC 


Circular left shift MQ register 


LOADMOQ 


Load MQ register 


PASS 


Pass ALU to Y (no shift operation) 


Function 


Set bit 1 


Set bit 0 


Test bit 1 


Test bit 0 


Absolute value 


Sign magnitude/twos-complement 


Add immediate 


Subtract immediate 


Table 16.9 


Continued 


Byte add RtoS 


Byte subtract S from R 


BINCS 


Byte subtract R from S 


Byte increment S 


BINCNS 


Byte increment negative S 


BXOR 


Byte XOR R and S 


BAND 


Byte AND R and S 


BOR 


Byte OR RandS 


Function 


Cyclic redundancy character accum. 


Select S or R 


Single length normalize 


DIVRF 


Double length normalize 


Divide remainder fix 


SDIVOF 


Signed divide quotient fix 


SMULI 


Signed multiply iterate 


SMULT 


Signed multiply terminate 


SDIVIN 


Signed divide initialize 


SDIVIS 
SDIVI 


Signed divide start 


Signed divide iterate 


UDIVIS 


Unsigned divide start 


UDIVI 


Unsigned divide iterate 


UMULI 


Unsigned multiply iterate 


SDIVIT 


Signed divide terminate 


UDIVIT 


LOADFF 


Unsigned divide terminate 


Function 


Load divide/BCD flip-flops 


CLR 


Clear 


DUMPFF 
BCDBIN 


Output divide/BCD flip-flops 
BCD to binary 


EX3BC 


Excess —3 byte correction 


EX3C 


Excess —3 word correction 


SDIVO 


Signed divide overflow test 


BINEX3 


Binary to excess —3 


NOP32 


No operation 


As an example of the coding used to specify fields 17 through 27, consider the 
instruction to add the contents of register 1 to register 2 and place the result in reg- 


ister 3. The symbolic instruction is 


CONT11 [17], WELH, SELRYFYMX, [24], R3, R2, R1, PASS + ADD 


624 CHAPTER 16 / MICROPROGRAMMED CONTROL 


The assembler will translate this into the appropriate bit pattern. The individual 
components of the instruction can be described as follows: 
e CONT11 is the basic NOP instruction. 


e Field [17] is changed to WELH (write enable, low and high), so that a 32-bit 
register is written into 


e Field [18] is changed to SELRFYMxX to select the feedback from the ALU Y 
MUX output. 


e Field [24] is changed to designate register R3 for the destination register. 
e Field [ 
e Field [26] is changed to designate register R1 for one of the source registers. 


e Field [27] is changed to specify an ALU operation of ADD. The ALU shifter 
instruction is PASS; therefore, the ALU output is not shifted by the shifter. 


25] is changed to designate register R2 for one of the source registers. 


Several points can be made about the symbolic notation. It is not necessary to 
specify the field number for consecutive fields. That is, 


CONTI11 [17], WELH, [18], SELRFYMX 
can be written as 
CONT11 [17], WELH, SELRFYMX 


because SELRFYMxX is in field 18. 
ALU instructions from Group 1 of Table 16.9 must always be used in conjunction 
with Group 2. ALU instructions from Groups 3-5 must not be used with Group 2. 


16.5 RECOMMENDED READING 


There are a number of books devoted to microprogramming. Perhaps the most comprehen- 
sive is [LYNC93]. [SEGE91] presents the fundamentals of microcoding and the design of 
microcoded systems by means of a step-by-step design of a simple 16-bit processor. 
[CART96] also presents the basic concepts using a sample machine. [PARK89] and [TI90] 
provide a detailed description of the TI 8800 Software Development Board. 

[VASS03] discuss the evolution of microcode use in computer design and its current status. 


CART96 Carter, J. Microprocesser Architecture and Microprogramming. Upper Saddle 
River, NJ: Prentice Hall, 1996. 

LYNC93_ Lynch, M. Microprogrammed State Machine Design. Boca Raton, FL: CRC 
Press, 1993. 


PARK89 Parker, A., and Hamblen, J. An Introduction to Microprogramming with Exer- 
cises Designed for the Texas Instruments SN74ACT8800 Software Development 


Board. Dallas, TX: Texas Instruments, 1989. 


SEGE91_ Segee, B., and Field, J. Microprogramming and Computer Architecture. New 
York: Wiley, 1991. 


TI90_ Texas Instruments Inc. SV74ACT880 Family Data Manual. SCSS006C, 1990. 


VASS03__- Vassiliadis, S.; Wong, S.; and Cotofana, S. “Microcode Processing: Positioning 
and Directions.” JEEE Micro, July-August 2003. 
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16.6 KEY TERMS,REVIEW QUESTIONS,AND PROBLEMS 


Key Terms 


control memory microinstruction encoding microprogrammed control unit 
control word microinstruction execution microprogramming language 
firmware microinstruction sequencing soft microprogramming 


hard microprogramming microinstructions unpacked microinstruction 
horizontal microinstruction microprogram vertical microinstruction 


Review Questions 


16.1 What is the difference between a hardwired implementation and a microprogrammed 
implementation of a control unit? 

16.2. Howisa horizontal microinstruction interpreted? 
16.3 What is the purpose of a control memory? 
16.4 What is a typical sequence in the execution of a horizontal microinstruction? 
16.5 What is the difference between horizontal and vertical microinstructions? 
16.6 What are the basic tasks performed by a microprogrammed control unit? 
16.7 What is the difference between packed and unpacked microinstructions? 
16.8 What is the difference between hard and soft microprogramming? 
16.9 What is the difference between functional and resource encoding? 

16.10 List some common applications of microprogramming. 


Problems 


16.1 Describe the implementation of the multiply instruction in the hypothetical machine 
designed by Wilkes. Use narrative and a flowchart. 


16.2 Assume a microinstruction set that includes a microinstruction with the following 
symbolic form: 


IF (AC, = 1) THEN CAR < (Cy_,) ELSE CAR< (CAR) + 1 


where AC, is the sign bit of the accumulator and Cy_¢ are the first seven bits of the 
microinstruction. Using this microinstruction, write a microprogram that implements 
a Branch Register Minus (BRM) machine instruction, which branches if the AC is 
negative. Assume that bits C, through C,, of the microinstruction specify a parallel set 
of micro-operations. Express the program symbolically. 


16.3. A simple processor has four major phases to its instruction cycle: fetch, indirect, execute, 
and interrupt. Two 1-bit flags designate the current phase in a hardwired implementation. 
a. Why are these flags needed? 
b. Why are they not needed in a microprogrammed control unit? 


16.4 Consider the control unit of Figure 16.7. Assume that the control memory is 24 bits 
wide. The control portion of the microinstruction format is divided into two fields. A 
micro-operation field of 13 bits specifies the micro-operations to be performed. An 
address selection field specifies a condition, based on the flags, that will cause a mi- 
croinstruction branch. There are eight flags. 

a. How many bits are in the address selection field? 
b. How many bits are in the address field? 
c. What is the size of the control memory? 
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16.5 


16.6 


16.7 


16.8 


How can unconditional branching be done under the circumstances of the previous 
problem? How can branching be avoided; that is, describe a microinstruction that 
does not specify any branch, conditional or unconditional. 

We wish to provide 8 control words for each machine instruction routine. Machine in- 
struction opcodes have 5 bits, and control memory has 1024 words. Suggest a mapping 
from the instruction register to the control address register. 

An encoded microinstruction format is to be used. Show how a 9-bit micro-operation 
field can be divided into subfields to specify 46 different actions. 

A processor has 16 registers, an ALU with 16 logic and 16 arithmetic functions, and a 
shifter with 8 operations, all connected by an internal processor bus. Design a mi- 
croinstruction format to specify the various micro-operations for the processor. 


Way eI Se. OS Laas 
PART FIVE 


Parallel Organization 


P.1 ISSUES FOR PART FIVE 


The final part of the book looks at the increasingly important area of parallel orga- 
nization. In a parallel organization, multiple processing units cooperate to execute 
applications. Whereas a superscalar processor exploits opportunities for parallel ex- 
ecution at the instruction level, a parallel processing organization looks for a grosser 
level of parallelism, one that enables work to be done in parallel, and cooperatively, 
by multiple processors. A number of issues are raised by such organizations. For ex- 
ample, if multiple processors, each with its own cache, share access to the same 
memory, hardware or software mechanisms must be employed to ensure that both 
processors share a valid image of main memory; this is known as the cache coher- 
ence problem. This design issue, and others, is explored in Part Five. 


ROAD MAP FOR PART FIVE 


Chapter 17 Parallel Processing 


Chapter 17 provides an overview of parallel processing considerations. Then 
the chapter looks at three approaches to organizing multiple processors: sym- 
metric multiprocessors (SMPs), clusters, and nonuniform memory access 
(NUMA) machines. SMPs and clusters are the two most common ways of 
organizing multiple processors to improve performance and availability. 
NUMA systems are a more recent concept that have not yet achieved wide- 
spread commercial success but that show considerable promise. Finally, 
Chapter 17 looks at the specialized organization known as a vector processor. 


Chapter 18 Multicore Computers 


A multicore computer is a computer chip that contains more than one 
processor (core). Multicore chips allow for greater increases in computing 
power in contrast to a single power continually made to run faster. Chapter 
18 looks at some of the fundamental design issues for multicore computers 
and provides examples from the Intel x86 and ARM architectures. 
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KEY POINTS 


® A traditional way to increase system performance is to use multiple proces- 
sors that can execute in parallel to support a given workload. The two most 
common multiple-processor organizations are symmetric multiprocessors 
(SMPs) and clusters. More recently, nonuniform memory access (NUMA) 
systems have been introduced commercially. 


® AnSMP consists of multiple similar processors within the same computer, 
interconnected by a bus or some sort of switching arrangement. The most 
critical problem to address in an SMP is that of cache coherence. Each 
processor has its own cache and so it is possible for a given line of data to be 
present in more than one cache. If such a line is altered in one cache, then 
both main memory and the other cache have an invalid version of that line. 
Cache coherence protocols are designed to cope with this problem. 


® When more than one processor are implemented on a single chip, the con- 
figuration is referred to as chip multiprocessing. A related design scheme is 
to replicate some of the components of a single processor so that the 
processor can execute multiple threads concurrently; this is known as a 
multithreaded processor. 


® A cluster is a group of interconnected, whole computers working together 
as a unified computing resource that can create the illusion of being one 
machine. The term whole computer means a system that can run on its own, 
apart from the cluster. 


® A NUMA system is a shared-memory multiprocessor in which the access 
time from a given processor to a word in memory varies with the location of 
the memory word. 


® A special-purpose type of parallel organization is the vector facility, which 
is tailored to the processing of vectors or arrays of data. 


Traditionally, the computer has been viewed as a sequential machine. Most computer 
programming languages require the programmer to specify algorithms as sequences 
of instructions. Processors execute programs by executing machine instructions in a 
sequence and one at a time. Each instruction is executed in a sequence of operations 
(fetch instruction, fetch operands, perform operation, store results). 

This view of the computer has never been entirely true. At the micro-operation 
level, multiple control signals are generated at the same time. Instruction pipelining, at 
least to the extent of overlapping fetch and execute operations, has been around for a long 
time. Both of these are examples of performing functions in parallel. This approach is 
taken further with superscalar organization, which exploits instruction-level parallelism. 
With a superscalar machine, there are multiple execution units within a single processor, 
and these may execute multiple instructions from the same program in parallel. 

As computer technology has evolved, and as the cost of computer hardware has 
dropped, computer designers have sought more and more opportunities for parallelism, 
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usually to enhance performance and, in some cases, to increase availability. After an 
overview, this chapter looks at some of the most prominent approaches to parallel or- 
ganization. First, we examine symmetric multiprocessors (SMPs), one of the earliest 
and still the most common example of parallel organization. In an SMP organization, 
multiple processors share a common memory. This organization raises the issue of 
cache coherence, to which a separate section is devoted. Then we describe clusters, 
which consist of multiple independent computers organized in a cooperative fashion. 
Next, the chapter examines multithreaded processors and chip multiprocessors. Clus- 
ters have become increasingly common to support workloads that are beyond the 
capacity of a single SMP. Another approach to the use of multiple processors that we 
examine is that of nonuniform memory access (NUMA) machines. The NUMA 
approach is relatively new and not yet proven in the marketplace, but is often consid- 
ered as an alternative to the SMP or cluster approach. Finally, this chapter looks at 
hardware organizational approaches to vector computation. These approaches opti- 
mize the ALU for processing vectors or arrays of floating-point numbers. They are 
common on the class of systems known as supercomputers. 


17.1 MULTIPLE PROCESSOR ORGANIZATIONS 


Types of Parallel Processor Systems 


A taxonomy first introduced by Flynn [FLYN72] is still the most common way of 
categorizing systems with parallel processing capability. Flynn proposed the follow- 
ing categories of computer systems: 


e Single instruction, single data (SISD) stream: A single processor executes a 
single instruction stream to operate on data stored in a single memory. 
Uniprocessors fall into this category. 


e Single instruction, multiple data (SIMD) stream: A single machine instruction 
controls the simultaneous execution of a number of processing elements on a 
lockstep basis. Each processing element has an associated data memory, so 
that each instruction is executed on a different set of data by the different 
processors. Vector and array processors fall into this category, and are dis- 
cussed in Section 18.7. 


e Multiple instruction, single data (MISD) stream: A sequence of data is trans- 
mitted to a set of processors, each of which executes a different instruction se- 
quence. This structure is not commercially implemented. 


e Multiple instruction, multiple data (MIMD) stream: A set of processors simul- 
taneously execute different instruction sequences on different data sets. SMPs, 
clusters, and NUMA systems fit into this category. 


With the MIMD organization, the processors are general purpose; each is able 
to process all of the instructions necessary to perform the appropriate data transfor- 
mation. MIMDs can be further subdivided by the means in which the processors 
communicate (Figure 17.1). If the processors share a common memory, then each 
processor accesses programs and data stored in the shared memory, and processors 
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Processor organizations 


Single instruction, Single instruction, Multiple instruction, Multiple instruction, 
single data stream multiple data stream single data stream multiple data stream 
(SISD) (SIMD) (MISD) (MIMD) 
Uniprocessor 
Vector Array Shared memory Distributed memory 
processor — processor (tightly coupled) (loosely coupled) 
Clusters 
Symmetric Nonuniform 
multiprocessor memory 
(SMP) access 
(NUMA) 


Figure 17.1. A Taxonomy of Parallel Processor Architectures 


communicate with each other via that memory. The most common form of such system 
is known as a symmetric multiprocessor (SMP), which we examine in Section 17.2. In 
an SMP, multiple processors share a single memory or pool of memory by means of a 
shared bus or other interconnection mechanism; a distinguishing feature is that the 
memory access time to any region of memory is approximately the same for each 
processor. A more recent development is the nonuniform memory access (NUMA) or- 
ganization, which is described in Section 17.5. As the name suggests, the memory access 
time to different regions of memory may differ for a NUMA processor. 

A collection of independent uniprocessors or SMPs may be interconnected to 
form a cluster. Communication among the computers is either via fixed paths or via 
some network facility. 


Parallel Organizations 


Figure 17.2 illustrates the general organization of the taxonomy of Figure 17.1. 
Figure 17.2a shows the structure of an SISD. There is some sort of control unit (CU) 
that provides an instruction stream (IS) to a processing unit (PU). The processing 
unit operates on a single data stream (DS) from a memory unit (MU). With an 
SIMD, there is still a single control unit, now feeding a single instruction stream to 
multiple PUs. Each PU may have its own dedicated memory (illustrated in Figure 
17.2b), or there may be a shared memory. Finally, with the MIMD, there are multiple 
control units, each feeding a separate instruction stream to its own PU. The MIMD 
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Figure 17.2 Alternative Computer Organizations 


may be a shared-memory multiprocessor (Figure 17.2c) or a distributed-memory 
multicomputer (Figure 17.2d). 

The design issues relating to SMPs, clusters, and NUMAs are complex, involv- 
ing issues relating to physical organization, interconnection structures, interproces- 
sor communication, operating system design, and application software techniques. 
Our concern here is primarily with organization, although we touch briefly on oper- 
ating system design issues. 


17.2 SYMMETRIC MULTIPROCESSORS 


Until fairly recently, virtually all single-user personal computers and most worksta- 
tions contained a single general-purpose microprocessor. As demands for perfor- 
mance increase and as the cost of microprocessors continues to drop, vendors have 
introduced systems with an SMP organization. The term SMP refers to a computer 
hardware architecture and also to the operating system behavior that reflects that 
architecture. An SMP can be defined as a standalone computer system with the fol- 
lowing characteristics: 


1. There are two or more similar processors of comparable capability. 


2. These processors share the same main memory and I/O facilities and are in- 
terconnected by a bus or other internal connection scheme, such that memory 
access time is approximately the same for each processor. 
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3. All processors share access to I/O devices, either through the same channels or 
through different channels that provide paths to the same device. 


4. All processors can perform the same functions (hence the term symmetric). 


5. The system is controlled by an integrated operating system that provides in- 
teraction between processors and their programs at the job, task, file, and data 
element levels. 


Points 1 to 4 should be self-explanatory. Point 5 illustrates one of the contrasts 
with a loosely coupled multiprocessing system, such as a cluster. In the latter, the 
physical unit of interaction is usually a message or complete file. In an SMP, individ- 
ual data elements can constitute the level of interaction, and there can be a high de- 
gree of cooperation between processes. 

The operating system of an SMP schedules processes or threads across all of 
the processors. An SMP organization has a number of potential advantages over a 
uniprocessor organization, including the following: 


e Performance: If the work to be done by a computer can be organized so that 
some portions of the work can be done in parallel, then a system with multiple 
processors will yield greater performance than one with a single processor of 
the same type (Figure 17.3). 
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e Availability: In a symmetric multiprocessor, because all processors can per- 
form the same functions, the failure of a single processor does not halt the ma- 
chine. Instead, the system can continue to function at reduced performance. 


e Incremental growth: A user can enhance the performance of a system by 
adding an additional processor. 


e Scaling: Vendors can offer a range of products with different price and per- 
formance characteristics based on the number of processors configured in 
the system. 


It is important to note that these are potential, rather than guaranteed, benefits. The 
operating system must provide tools and functions to exploit the parallelism in an 
SMP system. 

An attractive feature of an SMP is that the existence of multiple processors is 
transparent to the user. The operating system takes care of scheduling of threads or 
processes on individual processors and of synchronization among processors. 


Organization 


Figure 17.4 depicts in general terms the organization of a multiprocessor system. 
There are two or more processors. Each processor is self-contained, including a con- 
trol unit, ALU, registers, and, typically, one or more levels of cache. Each processor 
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Figure 17.4 Generic Block Diagram of a Tightly Coupled Multiprocessor 
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Figure 17.5 Symmetric Multiprocessor Organization 


has access to a shared main memory and the I/O devices through some form of in- 
terconnection mechanism. The processors can communicate with each other 
through memory (messages and status information left in common data areas). It 
may also be possible for processors to exchange signals directly. The memory is 
often organized so that multiple simultaneous accesses to separate blocks of mem- 
ory are possible. In some configurations, each processor may also have its own pri- 
vate main memory and I/O channels in addition to the shared resources. 

The most common organization for personal computers, workstations, and 
servers is the time-shared bus. The time-shared bus is the simplest mechanism for 
constructing a multiprocessor system (Figure 17.5). The structure and interfaces are 
basically the same as for a single-processor system that uses a bus interconnection. 
The bus consists of control, address, and data lines. To facilitate DMA transfers from 
I/O processors, the following features are provided: 


e Addressing: It must be possible to distinguish modules on the bus to deter- 
mine the source and destination of data. 


e Arbitration: Any I/O module can temporarily function as “master.” A mecha- 
nism is provided to arbitrate competing requests for bus control, using some 
sort of priority scheme. 
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e Time-sharing: When one module is controlling the bus, other modules are locked 
out and must, if necessary, suspend operation until bus access is achieved. 


These uniprocessor features are directly usable in an SMP organization. In this 
latter case, there are now multiple processors as well as multiple I/O processors all 
attempting to gain access to one or more memory modules via the bus. 

The bus organization has several attractive features: 


e Simplicity: This is the simplest approach to multiprocessor organization. The 
physical interface and the addressing, arbitration, and time-sharing logic of 
each processor remain the same as in a single-processor system. 


e Flexibility: It is generally easy to expand the system by attaching more proces- 
sors to the bus. 


¢ Reliability: The bus is essentially a passive medium, and the failure of any 
attached device should not cause failure of the whole system. 


The main drawback to the bus organization is performance. All memory refer- 
ences pass through the common bus. Thus, the bus cycle time limits the speed of the 
system. To improve performance, it is desirable to equip each processor with a cache 
memory. This should reduce the number of bus accesses dramatically. Typically, 
workstation and PC SMPs have two levels of cache, with the L1 cache internal 
(same chip as the processor) and the L2 cache either internal or external. Some 
processors now employ a L3 cache as well. 

The use of caches introduces some new design considerations. Because each 
local cache contains an image of a portion of memory, if a word is altered in one 
cache, it could conceivably invalidate a word in another cache. To prevent this, the 
other processors must be alerted that an update has taken place. This problem is 
known as the cache coherence problem and is typically addressed in hardware rather 
than by the operating system. We address this issue in Section 17.4. 


Multiprocessor Operating System Design Considerations 


An SMP operating system manages processor and other computer resources so 
that the user perceives a single operating system controlling system resources. In 
fact, such a configuration should appear as a single-processor multiprogramming 
system. In both the SMP and uniprocessor cases, multiple jobs or processes may be 
active at one time, and it is the responsibility of the operating system to schedule 
their execution and to allocate resources. A user may construct applications that 
use multiple processes or multiple threads within processes without regard to 
whether a single processor or multiple processors will be available. Thus a multi- 
processor operating system must provide all the functionality of a multiprogram- 
ming system plus additional features to accommodate multiple processors. Among 
the key design issues: 


e Simultaneous concurrent processes: OS routines need to be reentrant to allow 
several processors to execute the same IS code simultaneously. With multiple 
processors executing the same or different parts of the OS, OS tables and 
management structures must be managed properly to avoid deadlock or in- 
valid operations. 
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Scheduling: Any processor may perform scheduling, so conflicts must be 
avoided. The scheduler must assign ready processes to available processors. 


Synchronization: With multiple active processes having potential access to 
shared address spaces or shared I/O resources, care must be taken to provide 
effective synchronization. Synchronization is a facility that enforces mutual 
exclusion and event ordering. 


Memory management: Memory management on a multiprocessor must deal 
with all of the issues found on uniprocessor machines, as is discussed in 
Chapter 8. In addition, the operating system needs to exploit the available 
hardware parallelism, such as multiported memories, to achieve the best per- 
formance. The paging mechanisms on different processors must be coordi- 
nated to enforce consistency when several processors share a page or segment 
and to decide on page replacement. 


Reliability and fault tolerance: The operating system should provide graceful 
degradation in the face of processor failure. The scheduler and other portions 
of the operating system must recognize the loss of a processor and restructure 
management tables accordingly. 


A Mainframe SMP 


Most PC and workstation SMPs use a bus interconnection strategy as depicted in 
Figure 17.5. It is instructive to look at an alternative approach, which is used for a re- 
cent implementation of the IBM zSeries mainframe family [STEG04, MAK04], 
called the z990. This family of systems spans a range from a uniprocessor with one 
main memory card to a high-end system with 48 processors and 8 memory cards. 
The key components of the configuration are shown in Figure 17.6: 


Dual-core processor chip: Each processor chip includes two identical central 
processors (CPs). The CP is a CISC superscalar microprocessor, in which most of 
the instructions are hardwired and the rest are executed by vertical microcode. 
Each CP includes a 256-KB L1 instruction cache and a 256-KB L1 data cache. 


L2 cache: Each L2 cache contains 32 MB. The L2 caches are arranged in clus- 
ters of five, with each cluster supporting eight processor chips and providing 
access to the entire main memory space. 

System control element (SCE): The SCE arbitrates system communication, 
and has a central role in maintaining cache coherence. 

Main store control (MSC): The MSCs interconnect the L2 caches and the 
main memory. 


Memory card: Each card holds 32 GB of memory. The maximum configurable 
memory consists of 8 memory cards for a total of 256 GB. Memory cards in- 
terconnect to the MSC via synchronous memory interfaces (SMIs). 

Memory bus adapter (MBA): The MBA provides an interface to various types 
of I/O channels. Traffic to/from the channels goes directly to the L2 cache. 


The microprocessor in the z990 is relatively uncommon compared with other 


modern processors because, although it is superscalar, it executes instructions in 
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Figure 17.6 IBM z990 Multiprocessor Structure 


strict architectural order. However, it makes up for this by having a shorter pipeline 
and much larger caches and TLBs compared with other processors, along with other 
performance-enhancing features. 

The z990 system comprises one to four books. Each book is a pluggable unit 
containing up to 12 processors with up to 64 GB of memory, I/O adapters, and a sys- 
tem control element (SCE) that connects these other elements. The SCE within 
each book contains a 32-MB L2 cache, which serves as the central coherency point 
for that particular book. Both the L2 cache and the main memory are accessible by 
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a processor or I/O adapter within that book or any of the other three books in the 
system. The SCE and L2 cache chips also connect with corresponding elements on 
the other books in a ring configuration. 

There are a several interesting features in the z2990 SMP configuration, which 
we discuss in turn: 


e Switched interconnection 
e Shared L2 caches 


SWITCHED INTERCONNECTION A single shared bus is a common arrangement 
on SMPs for PCs and workstations (Figure 17.5). With this arrangement, the single 
bus becomes a bottleneck affecting the scalability (ability to scale to larger sizes) 
of the design. The z990 copes with this problem in two ways. First, main memory is 
split into multiple cards, each with its own storage controller that can handle 
memory accesses at high speeds. The average traffic load to main memory is cut, 
because of the independent paths to separate parts of memory. Each book in- 
cludes two memory cards, for a total of eight cards across a maximum configura- 
tion. Second, the connection from processors (actually from L2 caches) to a single 
memory card is not in the form of a shared bus but rather point-to-point links. 
Each processor chip has a link to each of the L2 caches on the same book, and 
each L2 cache has a link, via the MSC, to each of the two memory cards on the 
same book. 

Each L2 cache only connects to the two memory cards on the same book. The 
system controller provides links (not shown) to the other books in the configura- 
tion, so that all of main memory is accessible by all of the processors. 

Point-to-point links rather than a bus also provides connections to I/O chan- 
nels. Each L2 cache on a book connects to each of the MBAs for that book. The 
MBAs, in turn, connect to the I/O channels. 


SHARED L2 CACHES In a typical two-level cache scheme for an SMP, each proces- 
sor has a dedicated L1 cache and a dedicated L2 cache. In recent years, interest in 
the concept of a shared L2 cache has been growing. In an earlier version of its main- 
frame SMP, known as generation 3 (G3), IBM made use of dedicated L2 caches. In 
its later versions (G4, G5, and z900 series), a shared L2 cache is used. Two consider- 
ations dictated this change: 


1. In moving from G3 to G4, IBM doubled the speed of the microprocessors. If 
the G3 organization were retained, a significant increase in bus traffic would 
occur. At the same time, it was desired to reuse as many G3 components as pos- 
sible. Without a significant bus upgrade, the BSNs would become a bottleneck. 


nN 


. Analysis of typical mainframe workloads revealed a large degree of sharing of 
instructions and data among processors. 


These considerations led the G4 design team to consider the use of one or 
more L2 caches, each of which was shared by multiple processors (each processor 
having a dedicated on-chip L1 cache). At first glance, sharing an L2 cache might 
seem a bad idea. Access to memory from processors should be slower because the 
processors must now contend for access to a single L2 cache. However, if a sufficient 
amount of data is in fact shared by multiple processors, then a shared cache can 
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increase throughput rather than retard it. Data that are shared and found in the 
shared cache are obtained more quickly than if they must be obtained over the bus. 


17.3 CACHE COHERENCE AND THE MESI PROTOCOL 


In contemporary multiprocessor systems, it is customary to have one or two levels of 
cache associated with each processor. This organization is essential to achieve rea- 
sonable performance. It does, however, create a problem known as the cache coher- 
ence problem. The essence of the problem is this: Multiple copies of the same data 
can exist in different caches simultaneously, and if processors are allowed to update 
their own copies freely, an inconsistent view of memory can result. In Chapter 4 we 
defined two common write policies: 


e Write back: Write operations are usually made only to the cache. Main mem- 
ory is only updated when the corresponding cache line is flushed from the 
cache. 


e Write through: All write operations are made to main memory as well as to 
the cache, ensuring that main memory is always valid. 


It is clear that a write-back policy can result in inconsistency. If two caches 
contain the same line, and the line is updated in one cache, the other cache will 
unknowingly have an invalid value. Subsequent reads to that invalid line produce 
invalid results. Even with the write-through policy, inconsistency can occur unless 
other caches monitor the memory traffic or receive some direct notification of 
the update. 

In this section, we will briefly survey various approaches to the cache coher- 
ence problem and then focus on the approach that is most widely used: the MESI 
(modified/exclusive/shared/invalid) protocol. A version of this protocol is used on 
both the Pentium 4 and PowerPC implementations. 

For any cache coherence protocol, the objective is to let recently used local 
variables get into the appropriate cache and stay there through numerous reads and 
write, while using the protocol to maintain consistency of shared variables that 
might be in multiple caches at the same time. Cache coherence approaches have 
generally been divided into software and hardware approaches. Some implementa- 
tions adopt a strategy that involves both software and hardware elements. Never- 
theless, the classification into software and hardware approaches is still instructive 
and is commonly used in surveying cache coherence strategies. 


Software Solutions 


Software cache coherence schemes attempt to avoid the need for additional hard- 
ware circuitry and logic by relying on the compiler and operating system to deal 
with the problem. Software approaches are attractive because the overhead of de- 
tecting potential problems is transferred from run time to compile time, and the de- 
sign complexity is transferred from hardware to software. On the other hand, 
compile-time software approaches generally must make conservative decisions, 
leading to inefficient cache utilization. 
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Compiler-based coherence mechanisms perform an analysis on the code to de- 
termine which data items may become unsafe for caching, and they mark those 
items accordingly. The operating system or hardware then prevents noncacheable 
items from being cached. 

The simplest approach is to prevent any shared data variables from being 
cached. This is too conservative, because a shared data structure may be exclusively 
used during some periods and may be effectively read-only during other periods. It 
is only during periods when at least one process may update the variable and at least 
one other process may access the variable that cache coherence is an issue. 

More efficient approaches analyze the code to determine safe periods for 
shared variables. The compiler then inserts instructions into the generated code to 
enforce cache coherence during the critical periods. A number of techniques have 
been developed for performing the analysis and for enforcing the results; see 
[LILJ93] and [STEN90] for surveys. 


Hardware Solutions 


Hardware-based solutions are generally referred to as cache coherence protocols. 
These solutions provide dynamic recognition at run time of potential inconsistency 
conditions. Because the problem is only dealt with when it actually arises, there is 
more effective use of caches, leading to improved performance over a software 
approach. In addition, these approaches are transparent to the programmer and the 
compiler, reducing the software development burden. 

Hardware schemes differ in a number of particulars, including where the state 
information about data lines is held, how that information is organized, where coher- 
ence is enforced, and the enforcement mechanisms. In general, hardware schemes 
can be divided into two categories: directory protocols and snoopy protocols. 


DIRECTORY PROTOCOLS Directory protocols collect and maintain information 
about where copies of lines reside. Typically, there is a centralized controller that is 
part of the main memory controller, and a directory that is stored in main memory. 
The directory contains global state information about the contents of the various 
local caches. When an individual cache controller makes a request, the centralized 
controller checks and issues necessary commands for data transfer between mem- 
ory and caches or between caches. It is also responsible for keeping the state infor- 
mation up to date; therefore, every local action that can affect the global state of a 
line must be reported to the central controller. 

Typically, the controller maintains information about which processors have a 
copy of which lines. Before a processor can write to a local copy of a line, it must re- 
quest exclusive access to the line from the controller. Before granting this exclusive 
access, the controller sends a message to all processors with a cached copy of this 
line, forcing each processor to invalidate its copy. After receiving acknowledgments 
back from each such processor, the controller grants exclusive access to the request- 
ing processor. When another processor tries to read a line that is exclusively granted 
to another processor, it will send a miss notification to the controller. The controller 
then issues a command to the processor holding that line that requires the processor 
to do a write back to main memory. The line may now be shared for reading by the 
original processor and the requesting processor. 
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Directory schemes suffer from the drawbacks of a central bottleneck and the 
overhead of communication between the various cache controllers and the central 
controller. However, they are effective in large-scale systems that involve multiple 
buses or some other complex interconnection scheme. 


SNOOPY PROTOCOLS Snoopy protocols distribute the responsibility for maintain- 
ing cache coherence among all of the cache controllers in a multiprocessor. A cache 
must recognize when a line that it holds is shared with other caches. When an update 
action is performed on a shared cache line, it must be announced to all other caches 
by a broadcast mechanism. Each cache controller is able to “snoop” on the network 
to observe these broadcasted notifications, and react accordingly. 

Snoopy protocols are ideally suited to a bus-based multiprocessor, because the 
shared bus provides a simple means for broadcasting and snooping. However, be- 
cause one of the objectives of the use of local caches is to avoid bus accesses, care 
must be taken that the increased bus traffic required for broadcasting and snooping 
does not cancel out the gains from the use of local caches. 

Two basic approaches to the snoopy protocol have been explored: write inval- 
idate and write update (or write broadcast). With a write-invalidate protocol, there 
can be multiple readers but only one writer at a time. Initially, a line may be shared 
among several caches for reading purposes. When one of the caches wants to per- 
form a write to the line, it first issues a notice that invalidates that line in the other 
caches, making the line exclusive to the writing cache. Once the line is exclusive, the 
owning processor can make cheap local writes until some other processor requires 
the same line. 

With a write-update protocol, there can be multiple writers as well as multiple 
readers. When a processor wishes to update a shared line, the word to be updated is 
distributed to all others, and caches containing that line can update it. 

Neither of these two approaches is superior to the other under all circum- 
stances. Performance depends on the number of local caches and the pattern of 
memory reads and writes. Some systems implement adaptive protocols that employ 
both write-invalidate and write-update mechanisms. 

The write-invalidate approach is the most widely used in commercial multi- 
processor systems, such as the Pentium 4 and PowerPC. It marks the state of every 
cache line (using two extra bits in the cache tag) as modified, exclusive, shared, or in- 
valid. For this reason, the write-invalidate protocol is called MESI. In the remainder 
of this section, we will look at its use among local caches across a multiprocessor. 
For simplicity in the presentation, we do not examine the mechanisms involved in 
coordinating among both level 1 and level 2 locally as well as at the same time coor- 
dinating across the distributed multiprocessor. This would not add any new princi- 
ples but would greatly complicate the discussion. 


The MESI Protocol 


To provide cache consistency on an SMP, the data cache often supports a protocol 
known as MESI. For MESI, the data cache includes two status bits per tag, so that 
each line can be in one of four states: 


e Modified: The line in the cache has been modified (different from main memory) 
and is available only in this cache. 
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Table 17.1. MESI Cache Line States 


M E S I 
Modified Exclusive Shared Invalid 


This cache line valid? Yes Yes Yes No 


The memory copy is... out of date valid valid — 


Copies exist in other caches? No No Maybe Maybe 


A write to this line... does not go to bus | does not goto bus | goes to bus and | goes directly 
updates cache to bus 


e Exclusive: The line in the cache is the same as that in main memory and is not 
present in any other cache. 


e Shared: The line in the cache is the same as that in main memory and may be 
present in another cache. 


e Invalid: The line in the cache does not contain valid data. 


Table 17.1 summarizes the meaning of the four states. Figure 17.7 displays a 
state diagram for the MESI protocol. Keep in mind that each line of the cache has 
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Figure 17.7 MESI State Transition Diagram 
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its own state bits and therefore its own realization of the state diagram. Figure 17.7a 
shows the transitions that occur due to actions initiated by the processor attached to 
this cache. Figure 17.7b shows the transitions that occur due to events that are 
snooped on the common bus. This presentation of separate state diagrams for 
processor-initiated and bus-initiated actions helps to clarify the logic of the MESI 
protocol. At any time a cache line is in a single state. If the next event is from the 
attached processor, then the transition is dictated by Figure 17.7a and if the next 
event is from the bus, the transition is dictated by Figure 17.7b. Let us look at these 
transitions in more detail. 


READ MISS When a read miss occurs in the local cache, the processor initiates a 
memory read to read the line of main memory containing the missing address. The 
processor inserts a signal on the bus that alerts all other processor/cache units to 
snoop the transaction. There are a number of possible outcomes: 


e If one other cache has a clean (unmodified since read from memory) copy of 
the line in the exclusive state, it returns a signal indicating that it shares this 
line. The responding processor then transitions the state of its copy from 
exclusive to shared, and the initiating processor reads the line from main 
memory and transitions the line in its cache from invalid to shared. 


e If one or more caches have a clean copy of the line in the shared state, each of 
them signals that it shares the line. The initiating processor reads the line and 
transitions the line in its cache from invalid to shared. 


e If one other cache has a modified copy of the line, then that cache blocks the 
memory read and provides the line to the requesting cache over the shared 
bus. The responding cache then changes its line from modified to shared.' The 
line sent to the requesting cache is also received and processed by the memory 
controller, which stores the block in memory. 


e Ifno other cache has a copy of the line (clean or modified), then no signals are 
returned. The initiating processor reads the line and transitions the line in its 
cache from invalid to exclusive. 


READ HIT When a read hit occurs on a line currently in the local cache, the proces- 
sor simply reads the required item. There is no state change: The state remains modi- 
fied, shared, or exclusive. 


WRITE MISS When a write miss occurs in the local cache, the processor initiates a 
memory read to read the line of main memory containing the missing address. For 
this purpose, the processor issues a signal on the bus that means read-with-intent-to- 
modify (RWITM). When the line is loaded, it is immediately marked modified. With 
respect to other caches, two possible scenarios precede the loading of the line of data. 

First, some other cache may have a modified copy of this line (state = modify). 
In this case, the alerted processor signals the initiating processor that another processor 


'In some implementations, the cache with the modified line signals the initiating processor to retry. 
Meanwhile, the processor with the modified copy seizes the bus, writes the modified line back to main 
memory, and transitions the line in its cache from modified to shared. Subsequently, the requesting 
processor tries again and finds that one or more processors have a clean copy of the line in the shared 
state, as described in the preceding point. 
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has a modified copy of the line. The initiating processor surrenders the bus and 
waits. The other processor gains access to the bus, writes the modified cache line 
back to main memory, and transitions the state of the cache line to invalid (because 
the initiating processor is going to modify this line). Subsequently, the initiating 
processor will again issue a signal to the bus of RWITM and then read the line 
from main memory, modify the line in the cache, and mark the line in the modi- 
fied state. 

The second scenario is that no other cache has a modified copy of the requested 
line. In this case, no signal is returned, and the initiating processor proceeds to read in 
the line and modify it. Meanwhile, if one or more caches have a clean copy of the line 
in the shared state, each cache invalidates its copy of the line, and if one cache has a 
clean copy of the line in the exclusive state, it invalidates its copy of the line. 


WRITE HIT When a write hit occurs on a line currently in the local cache, the effect 
depends on the current state of that line in the local cache: 


e Shared: Before performing the update, the processor must gain exclusive own- 
ership of the line. The processor signals its intent on the bus. Each processor 
that has a shared copy of the line in its cache transitions the sector from shared 
to invalid. The initiating processor then performs the update and transitions its 
copy of the line from shared to modified. 


e Exclusive: The processor already has exclusive control of this line, and so it 
simply performs the update and transitions its copy of the line from exclusive 
to modified. 


e Modified: The processor already has exclusive control of this line and has the 
line marked as modified, and so it simply performs the update. 


L1-L2 CACHE CONSISTENCY We have so far described cache coherency protocols 
in terms of the cooperate activity among caches connected to the same bus or other 
SMP interconnection facility. Typically, these caches are L2 caches, and each proces- 
sor also has an L1 cache that does not connect directly to the bus and that therefore 
cannot engage in a snoopy protocol. Thus, some scheme is needed to maintain data 
integrity across both levels of cache and across all caches in the SMP configuration. 

The strategy is to extend the MESI protocol (or any cache coherence proto- 
col) to the L1 caches. Thus, each line in the L1 cache includes bits to indicate the 
state. In essence, the objective is the following: for any line that is present in both an 
L2 cache and its corresponding L1 cache, the L1 line state should track the state of 
the L2 line. A simple means of doing this is to adopt the write-through policy in the 
L1 cache; in this case the write through is to the L2 cache and not to the memory. 
The L1 write-through policy forces any modification to an L1 line out to the L2 
cache and therefore makes it visible to other L2 caches. The use of the L1 write- 
through policy requires that the L1 content must be a subset of the L2 content. This 
in turn suggests that the associativity of the L2 cache should be equal to or greater 
than that of the L1 associativity. The L1 write-through policy is used in the IBM 
S/390 SMP. 

If the L1 cache has a write-back policy, the relationship between the two caches 
is more complex. There are several approaches to maintaining coherence. For exam- 
ple, the approach used on the Pentium II is described in detail in [SHANOS]. 
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17.4 MULTITHREADING AND CHIP MULTIPROCESSORS 


The most important measure of performance for a processor is the rate at which it 
executes instructions. This can be expressed as 


MIPS rate = f X IPC 


where f is the processor clock frequency, in MHz, and /PC (instructions per cycle) 
is the average number of instructions executed per cycle. Accordingly, designers 
have pursued the goal of increased performance on two fronts: increasing clock 
frequency and increasing the number of instructions executed or, more properly, 
the number of instructions that complete during a processor cycle. As we have 
seen in earlier chapters, designers have increased IPC by using an instruction 
pipeline and then by using multiple parallel instruction pipelines in a superscalar 
architecture. With pipelined and multiple-pipeline designs, the principal problem is 
to maximize the utilization of each pipeline stage. To improve throughput, design- 
ers have created ever more complex mechanisms, such as executing some instruc- 
tions in a different order from the way they occur in the instruction stream and 
beginning execution of instructions that may never be needed. But as was dis- 
cussed in Section 2.2, this approach may be reaching a limit due to complexity and 
power consumption concerns. 

An alternative approach, which allows for a high degree of instruction-level 
parallelism without increasing circuit complexity or power consumption, is called 
multithreading. In essence, the instruction stream is divided into several smaller 
streams, known as threads, such that the threads can be executed in parallel. 

The variety of specific multithreading designs, realized in both commercial 
systems and experimental systems, is vast. In this section, we give a brief survey of 
the major concepts. 


Implicit and Explicit Multithreading 


The concept of thread used in discussing multithreaded processors may or may not 
be the same as the concept of software threads in a multiprogrammed operating sys- 
tem. It will be useful to define terms briefly: 


e Process: An instance of a program running on a computer. A process embod- 
ies two key characteristics: 


— Resource ownership: A process includes a virtual address space to hold the 
process image; the process image is the collection of program, data, stack, 
and attributes that define the process. From time to time, a process may be 
allocated control or ownership of resources, such as main memory, I/O chan- 
nels, I/O devices, and files. 


—Scheduling/execution: The execution of a process follows an execution path 
(trace) through one or more programs. This execution may be interleaved 
with that of other processes. Thus, a process has an execution state (Run- 
ning, Ready, etc.) and a dispatching priority and is the entity that is sched- 
uled and dispatched by the operating system. 
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e Process switch: An operation that switches the processor from one process to 
another, by saving all the process control data, registers, and other information 
for the first and replacing them with the process information for the second.” 


e Thread: A dispatchable unit of work within a process. It includes a processor 
context (which includes the program counter and stack pointer) and its own 
data area for a stack (to enable subroutine branching). A thread executes se- 
quentially and is interruptible so that the processor can turn to another thread. 


e Thread switch: The act of switching processor control from one thread to an- 
other within the same process. Typically, this type of switch is much less costly 
than a process switch. 


Thus, a thread is concerned with scheduling and execution, whereas a process 
is concerned with both scheduling/execution and resource ownership. The multiple 
threads within a process share the same resources. This is why a thread switch is 
much less time consuming than a process switch. Traditional operating systems, such 
as earlier versions of UNIX, did not support threads. Most modern operating sys- 
tems, such as Linux, other versions of UNIX, and Windows, do support thread. A 
distinction is made between user-level threads, which are visible to the application 
program, and kernel-level threads, which are visible only to the operating system. 
Both of these may be referred to as explicit threads, defined in software. 

All of the commercial processors and most of the experimental processors so 
far have used explicit multithreading. These systems concurrently execute instruc- 
tions from different explicit threads, either by interleaving instructions from differ- 
ent threads on shared pipelines or by parallel execution on parallel pipelines. 
Implicit multithreading refers to the concurrent execution of multiple threads 
extracted from a single sequential program. These implicit threads may be defined 
either statically by the compiler or dynamically by the hardware. In the remainder 
of this section we consider explicit multithreading. 


Approaches to Explicit Multithreading 


At minimum, a multithreaded processor must provide a separate program counter 
for each thread of execution to be executed concurrently. The designs differ in the 
amount and type of additional hardware used to support concurrent thread execu- 
tion. In general, instruction fetching takes place on a thread basis. The processor 
treats each thread separately and may use a number of techniques for optimizing 
single-thread execution, including branch prediction, register renaming, and super- 
scalar techniques. What is achieved is thread-level parallelism, which may provide 
for greatly improved performance when married to instruction-level parallelism. 
Broadly speaking, there are four principal approaches to multithreading: 


e Interleaved multithreading: This is also known as fine-grained multithreading. 
The processor deals with two or more thread contexts at a time, switching 
from one thread to another at each clock cycle. If a thread is blocked because 


The term context switch is often found in OS literature and textbooks. Unfortunately, although most of 
the literature uses this term to mean what is here called a process switch, other sources use it to mean a 
thread switch. To avoid ambiguity, the term is not used in this book. 
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of data dependencies or memory latencies, that thread is skipped and a ready 
thread is executed. 


e Blocked multithreading: This is also known as coarse-grained multithreading. 
The instructions of a thread are executed successively until an event occurs 
that may cause delay, such as a cache miss. This event induces a switch to an- 
other thread. This approach is effective on an in-order processor that would 
stall the pipeline for a delay event such as a cache miss. 


e Simultaneous multithreading (SMT): Instructions are simultaneously issued 
from multiple threads to the execution units of a superscalar processor. This 
combines the wide superscalar instruction issue capability with the use of mul- 
tiple thread contexts. 


e Chip multiprocessing: In this case, the entire processor is replicated on a single 
chip and each processor handles separate threads. The advantage of this 
approach is that the available logic area on a chip is used effectively without 
depending on ever-increasing complexity in pipeline design. This is referred to 
as multicore; we examine this topic separately in Chapter 18. 


For the first two approaches, instructions from different threads are not exe- 
cuted simultaneously. Instead, the processor is able to rapidly switch from one 
thread to another, using a different set of registers and other context information. 
This results in a better utilization of the processor’s execution resources and avoids 
a large penalty due to cache misses and other latency events. The SMT approach in- 
volves true simultaneous execution of instructions from different threads, using 
replicated execution resources. Chip multiprocessing also enables simultaneous ex- 
ecution of instructions from different threads. 

Figure 17.8, based on one in [UNGE02], illustrates some of the possible 
pipeline architectures that involve multithreading and contrasts these with approaches 
that do not use multithreading. Each horizontal row represents the potential issue 
slot or slots for a single execution cycle; that is, the width of each row corresponds 
to the maximum number of instructions that can be issued in a single clock cycle.* 
The vertical dimension represents the time sequence of clock cycles. An empty 
(shaded) slot represents an unused execution slot in one pipeline. A no-op is indi- 
cated byN. 

The first three illustrations in Figure 17.8 show different approaches with a scalar 
(i.e., single-issue) processor: 


e Single-threaded scalar: This is the simple pipeline found in traditional RISC 
and CISC machines, with no multithreading. 


e Interleaved multithreaded scalar: This is the easiest multithreading approach 
to implement. By switching from one thread to another at each clock cycle, the 
pipeline stages can be kept fully occupied, or close to fully occupied. The 
hardware must be capable of switching from one thread context to another 
between cycles. 


3Issue slots are the position from which instructions can be issued in a given clock cycle. Recall from 
Chapter 14 that instruction issue is the process of initiating instruction execution in the processor’s func- 
tional units. This occurs when an instruction moves from the decode stage of the pipeline to the first exe- 
cute stage of the pipeline. 
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Figure 17.8 Approaches to Executing Multiple Threads 


e Blocked multithreaded scalar: In this case, a single thread is executed until a 
latency event occurs that would stop the pipeline, at which time the processor 
switches to another thread. 


Figure 17.8c shows a situation in which the time to perform a thread switch is 
one cycle, whereas Figure 17.8b shows that thread switching occurs in zero cycles. In 
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the case of interleaved multithreading, it is assumed that there are no control or 
data dependencies between threads, which simplifies the pipeline design and there- 
fore should allow a thread switch with no delay. However, depending on the specific 
design and implementation, block multithreading may require a clock cycle to per- 
form a thread switch, as illustrated in Figure 17.8. This is true if a fetched instruction 
triggers the thread switch and must be discarded from the pipeline [UNGE03]. 

Although interleaved multithreading appears to offer better processor utiliza- 
tion than blocked multithreading, it does so at the sacrifice of single-thread perfor- 
mance. The multiple threads compete for cache resources, which raises the 
probability of a cache miss for a given thread. 

More opportunities for parallel execution are available if the processor can 
issue multiple instructions per cycle. Figures 17.8d through 17.8: illustrate a number 
of variations among processors that have hardware for issuing four instructions per 
cycle. In all these cases, only instructions from a single thread are issued in a single 
cycle. The following alternatives are illustrated: 


Superscalar: This is the basic superscalar approach with no multithreading. 
Until relatively recently, this was the most powerful approach to providing 
parallelism within a processor. Note that during some cycles, not all of the 
available issue slots are used. During these cycles, less than the maximum 
number of instructions is issued; this is referred to as horizontal loss. During 
other instruction cycles, no issue slots are used; these are cycles when no in- 
structions can be issued; this is referred to as vertical loss. 


Interleaved multithreading superscalar: During each cycle, as many instruc- 
tions as possible are issued from a single thread. With this technique, potential 
delays due to thread switches are eliminated, as previously discussed. How- 
ever, the number of instructions issued in any given cycle is still limited by de- 
pendencies that exist within any given thread. 


Blocked multithreaded superscalar: Again, instructions from only one thread 
may be issued during any cycle, and blocked multithreading is used. 


Very long instruction word (VLIW): A VLIW architecture, such as IA-64, 
places multiple instructions in a single word. Typically, a VLIW is constructed by 
the compiler, which places operations that may be executed in parallel in the 
same word. In a simple VLIW machine (Figure 17.8g), if it is not possible to com- 
pletely fill the word with instructions to be issued in parallel, no-ops are used. 


Interleaved multithreading VLIW: This approach should provide similar effi- 
ciencies to those provided by interleaved multithreading on a superscalar 
architecture. 


Blocked multithreaded VLIW: This approach should provide similar efficien- 
cies to those provided by blocked multithreading on a superscalar architecture. 


The final two approaches illustrated in Figure 17.8 enable the parallel, simul- 
taneous execution of multiple threads: 


e Simultaneous multithreading: Figure 17.81 shows a system capable of issuing 8 
instructions at a time. If one thread has a high degree of instruction-level 
parallelism, it may on some cycles be able fill all of the horizontal slots. On 
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other cycles, instructions from two or more threads may be issued. If sufficient 
threads are active, it should usually be possible to issue the maximum number 
of instructions on each cycle, providing a high level of efficiency. 


e Chip multiprocessor (multicore): Figure 17.8k shows a chip containing four 
processors, each of which has a two-issue superscalar processor. Each proces- 
sor is assigned a thread, from which it can issue up to two instructions per 
cycle. We discuss multicore computers in Chapter 18. 


Comparing Figures 17.8j and 17.8k, we see that a chip multiprocessor with the 
same instruction issue capability as an SMT cannot achieve the same degree of 
instruction-level parallelism. This is because the chip multiprocessor is not able to 
hide latencies by issuing instructions from other threads. On the other hand, the chip 
multiprocessor should outperform a superscalar processor with the same instruction 
issue capability, because the horizontal losses will be greater for the superscalar 
processor. In addition, it is possible to use multithreading within each of the proces- 
sors on a chip multiprocessor, and this is done on some contemporary machines. 


Example Systems 


PENTIUM 4 More recent models of the Pentium 4 use a multithreading technique 
that the Intel literature refers to as hyperthreading [MARR0O2]. In essence, the Pen- 
tium 4 approach is to use SMT with support for two threads. Thus, the single multi- 
threaded processor is logically two processors. 


IBM poweEr35 The IBM Power5S chip, which is used in high-end PowerPC products, 
combines chip multiprocessing with SMT [KALL04]. The chip has two separate 
processors, each of which is a multithreaded processor capable of supporting two 
threads concurrently using SMT. Interestingly, the designers simulated various alter- 
natives and found that having two two-way SMT processors on a single chip pro- 
vided superior performance to a single four-way SMT processor. The simulations 
showed that additional multithreading beyond the support for two threads might 
decrease performance because of cache thrashing, as data from one thread displaces 
data needed by another thread. 

Figure 17.9 shows the IBM Power5’s instruction flow diagram. Only a few of 
the elements in the processor need to be replicated, with separate elements dedi- 
cated to separate threads. Two program counters are used. The processor alternates 
fetching instructions, up to eight at a time, between the two threads. All the instruc- 
tions are stored in a common instruction cache and share an instruction translation 
facility, which does a partial instruction decode. When a conditional branch is en- 
countered, the branch prediction facility predicts the direction of the branch and, if 
possible, calculates the target address. For predicting the target of a subroutine re- 
turn, the processor uses a return stack, one for each thread. 

Instructions then move into two separate instruction buffers. Then, on the 
basis of thread priority, a group of instructions is selected and decoded in parallel. 
Next, instructions flow through a register-renaming facility in program order. Logi- 
cal registers are mapped to physical registers. The PowerS has 120 physical general- 
purpose registers and 120 physical floating-point registers. The instructions are then 
moved into issue queues. From the issue queues, instructions are issued using 
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symmetric multithreading. That is, the processor has a superscalar architecture and 
can issue instructions from one or both threads in parallel. At the end of the 
pipeline, separate thread resources are needed to commit the instructions. 


17.5 CLUSTERS 


An important and relatively recent development computer system design is cluster- 
ing. Clustering is an alternative to symmetric multiprocessing as an approach to pro- 
viding high performance and high availability and is particularly attractive for 
server applications. We can define a cluster as a group of interconnected, whole 
computers working together as a unified computing resource that can create the 
illusion of being one machine. The term whole computer means a system that can 
run on its own, apart from the cluster; in the literature, each computer in a cluster is 
typically referred to as a node. 

[BREW97] lists four benefits that can be achieved with clustering. These can 
also be thought of as objectives or design requirements: 


e Absolute scalability: It is possible to create large clusters that far surpass the 
power of even the largest standalone machines. A cluster can have tens, hun- 
dreds, or even thousands of machines, each of which is a multiprocessor. 


e Incremental scalability: A cluster is configured in such a way that it is possible 
to add new systems to the cluster in small increments. Thus, a user can start out 
with a modest system and expand it as needs grow, without having to go 
through a major upgrade in which an existing small system is replaced with a 
larger system. 


e High availability: Because each node in a cluster is a standalone computer, the 
failure of one node does not mean loss of service. In many products, fault tol- 
erance is handled automatically in software. 


e Superior price/performance: By using commodity building blocks, it is possible 
to put together a cluster with equal or greater computing power than a single 
large machine, at much lower cost. 


Cluster Configurations 


In the literature, clusters are classified in a number of different ways. Perhaps the 
simplest classification is based on whether the computers in a cluster share access to 
the same disks. Figure 17.10a shows a two-node cluster in which the only intercon- 
nection is by means of a high-speed link that can be used for message exchange to 
coordinate cluster activity. The link can be a LAN that is shared with other comput- 
ers that are not part of the cluster or the link can be a dedicated interconnection 
facility. In the latter case, one or more of the computers in the cluster will have a link 
toa LAN or WAN so that there is a connection between the server cluster and remote 
client systems. Note that in the figure, each computer is depicted as being a multi- 
processor. This is not necessary but does enhance both performance and availability. 

In the simple classification depicted in Figure 17.10, the other alternative is a 
shared-disk cluster. In this case, there generally is still a message link between 
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Figure 17.10 Cluster Configurations 


nodes. In addition, there is a disk subsystem that is directly linked to multiple com- 
puters within the cluster. In this figure, the common disk subsystem is a RAID sys- 
tem. The use of RAID or some similar redundant disk technology is common in 
clusters so that the high availability achieved by the presence of multiple computers 
is not compromised by a shared disk that is a single point of failure. 

A clearer picture of the range of cluster options can be gained by looking at 
functional alternatives. Table 17.2 provides a useful classification along functional 
lines, which we now discuss. 

A common, older method, known as passive standby, is simply to have one 
computer handle all of the processing load while the other computer remains inac- 
tive, standing by to take over in the event of a failure of the primary. To coordinate 
the machines, the active, or primary, system periodically sends a “heartbeat” mes- 
sage to the standby machine. Should these messages stop arriving, the standby 
assumes that the primary server has failed and puts itself into operation. This ap- 
proach increases availability but does not improve performance. Further, if the only 
information that is exchanged between the two systems is a heartbeat message, and 
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Clustering Method 


Description 


Clustering Methods: Benefits and Limitations 


Benefits 
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Limitations 


Passive Standby 


A secondary server takes 
over in case of primary 
server failure. 


Easy to implement. 


High cost because the 
secondary server is 
unavailable for other 
processing tasks. 


Active Secondary: 


The secondary server is 
also used for processing 
tasks. 


Reduced cost because 
secondary servers can 
be used for processing. 


Increased complexity. 


Separate Servers 


Servers Connected 
to Disks 


Separate servers have their 
own disks. Data is continu- 
ously copied from primary 
to secondary server. 


Servers are cabled to the 
same disks, but each server 
owns its disks. If one server 
fails, its disks are taken 
over by the other server. 


High availability. 


Reduced network and 

server overhead due to 
elimination of copying 

operations. 


High network and server 
overhead due to copying 
operations. 


Usually requires disk 
mirroring or RAID 
technology to 
compensate for risk 
of disk failure. 


Servers Share 
Disks 


Multiple servers 
simultaneously share 
access to disks. 


Low network and server 
overhead. Reduced risk 
of downtime caused by 


Requires lock manager 
software. Usually used 
with disk mirroring or 


disk failure. 


RAID technology. 


if the two systems do not share common disks, then the standby provides a func- 
tional backup but has no access to the databases managed by the primary. 

The passive standby is generally not referred to as a cluster. The term cluster is 
reserved for multiple interconnected computers that are all actively doing process- 
ing while maintaining the image of a single system to the outside world. The term 
active secondary is often used in referring to this configuration. Three classifications 
of clustering can be identified: separate servers, shared nothing, and shared memory. 

In one approach to clustering, each computer is a separate server with its own 
disks and there are no disks shared between systems (Figure 17.10a). This arrange- 
ment provides high performance as well as high availability. In this case, some type 
of management or scheduling software is needed to assign incoming client requests 
to servers so that the load is balanced and high utilization is achieved. It is desirable 
to have a failover capability, which means that if a computer fails while executing an 
application, another computer in the cluster can pick up and complete the applica- 
tion. For this to happen, data must constantly be copied among systems so that each 
system has access to the current data of the other systems. The overhead of this data 
exchange ensures high availability at the cost of a performance penalty. 

To reduce the communications overhead, most clusters now consist of servers 
connected to common disks (Figure 17.10b). In one variation on this approach, called 
shared nothing, the common disks are partitioned into volumes, and each volume is 
owned by a single computer. If that computer fails, the cluster must be reconfigured 
so that some other computer has ownership of the volumes of the failed computer. 

It is also possible to have multiple computers share the same disks at the same 
time (called the shared disk approach), so that each computer has access to all of the 
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volumes on all of the disks. This approach requires the use of some type of locking 
facility to ensure that data can only be accessed by one computer at a time. 


Operating System Design Issues 


Full exploitation of a cluster hardware configuration requires some enhancements 
to a single-system operating system. 


FAILURE MANAGEMENT How failures are managed by a cluster depends on the 
clustering method used (Table 17.2). In general, two approaches can be taken to 
dealing with failures: highly available clusters and fault-tolerant clusters. A highly 
available cluster offers a high probability that all resources will be in service. If a 
failure occurs, such as a system goes down or a disk volume is lost, then the queries 
in progress are lost. Any lost query, if retried, will be serviced by a different com- 
puter in the cluster. However, the cluster operating system makes no guarantee 
about the state of partially executed transactions. This would need to be handled at 
the application level. 

A fault-tolerant cluster ensures that all resources are always available. This is 
achieved by the use of redundant shared disks and mechanisms for backing out un- 
committed transactions and committing completed transactions. 

The function of switching applications and data resources over from a failed 
system to an alternative system in the cluster is referred to as failover. A related 
function is the restoration of applications and data resources to the original system 
once it has been fixed; this is referred to as failback. Failback can be automated, but 
this is desirable only if the problem is truly fixed and unlikely to recur. If not, auto- 
matic failback can cause subsequently failed resources to bounce back and forth be- 
tween computers, resulting in performance and recovery problems. 


LOAD BALANCING A cluster requires an effective capability for balancing the load 
among available computers. This includes the requirement that the cluster be incre- 
mentally scalable. When a new computer is added to the cluster, the load-balancing 
facility should automatically include this computer in scheduling applications. Mid- 
dleware mechanisms need to recognize that services can appear on different mem- 
bers of the cluster and may migrate from one member to another. 


PARALLELIZING COMPUTATION In some cases, effective use of a cluster requires 
executing software from a single application in parallel. [KAPPO00] lists three gen- 
eral approaches to the problem: 


e Parallelizing compiler: A parallelizing compiler determines, at compile time, 
which parts of an application can be executed in parallel. These are then split 
off to be assigned to different computers in the cluster. Performance depends 
on the nature of the problem and how well the compiler is designed. In gen- 
eral, such compilers are difficult to develop. 


e Parallelized application: In this approach, the programmer writes the appli- 
cation from the outset to run on a cluster, and uses message passing to 
move data, as required, between cluster nodes. This places a high burden 
on the programmer but may be the best approach for exploiting clusters 
for some applications. 
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e Parametric computing: This approach can be used if the essence of the appli- 
cation is an algorithm or program that must be executed a large number of 
times, each time with a different set of starting conditions or parameters. A 
good example is a simulation model, which will run a large number of different 
scenarios and then develop statistical summaries of the results. For this ap- 
proach to be effective, parametric processing tools are needed to organize, 
run, and manage the jobs in an effective manner. 


Cluster Computer Architecture 


Figure 17.11 shows a typical cluster architecture. The individual computers are con- 
nected by some high-speed LAN or switch hardware. Each computer is capable of 
operating independently. In addition, a middleware layer of software is installed in 
each computer to enable cluster operation. The cluster middleware provides a uni- 
fied system image to the user, known as a single-system image. The middleware is 
also responsible for providing high availability, by means of load balancing and re- 
sponding to failures in individual components. [HWAN99] lists the following as de- 
sirable cluster middleware services and functions: 


e Single entry point: A user logs onto the cluster rather than to an individual 
computer. 

e Single file hierarchy: The user sees a single hierarchy of file directories under 
the same root directory. 

e Single control point: There is a default workstation used for cluster manage- 
ment and control. 

e Single virtual networking: Any node can access any other point in the cluster, 
even though the actual cluster configuration may consist of multiple intercon- 
nected networks. There is a single virtual network operation. 

e Single memory space: Distributed shared memory enables programs to share 
variables. 

e Single job-management system: Under a cluster job scheduler, a user can sub- 
mit a job without specifying the host computer to execute the job. 

e Single user interface: A common graphic interface supports all users, regard- 
less of the workstation from which they enter the cluster. 

e Single I/O space: Any node can remotely access any I/O peripheral or disk de- 
vice without knowledge of its physical location. 

e Single process space: A uniform process-identification scheme is used. A 
process on any node can create or communicate with any other process on a 
remote node. 

e Checkpointing: This function periodically saves the process state and interme- 
diate computing results, to allow rollback recovery after a failure. 


e Process migration: This function enables load balancing. 


The last four items on the preceding list enhance the availability of the cluster. 
The remaining items are concerned with providing a single system image. 
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Figure 17.12 Example 100-Gbps Ethernet Configuration for 
Massive Blade Server Site 


Returning to Figure 17.11, a cluster will also include software tools for en- 
abling the efficient execution of programs that are capable of parallel execution. 


Blade Servers 


A common implementation of the cluster approach is the blade server. A blade 
server is a server architecture that houses multiple server modules (“blades”) in a 
single chassis. It is widely used in data centers to save space and improve system 
management. Either self-standing or rack mounted, the chassis provides the power 
supply, and each blade has its own processor, memory, and hard disk. 

An example of the application is shown in Figure 17.12, taken from [NOWE07]. 
The trend at large data centers, with substantial banks of blade servers, is the deploy- 
ment of 10-Gbps ports on individual servers to handle the massive multimedia traffic 
provided by these servers. Such arrangements are stressing the on-site Ethernet 
switches needed to interconnect large numbers of servers. A 100-Gbps rate provides 
the bandwidth required to handle the increased traffic load. The 100-Gbps Ethernet 
switches are deployed in switch uplinks inside the data center as well as providing 
interbuilding, intercampus, wide area connections for enterprise networks. 


Clusters Compared to SMP 


Both clusters and symmetric multiprocessors provide a configuration with multiple 
processors to support high-demand applications. Both solutions are commercially 
available, although SMP schemes have been around far longer. 

The main strength of the SMP approach is that an SMP is easier to manage 
and configure than a cluster. The SMP is much closer to the original single-processor 
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model for which nearly all applications are written. The principal change required in 
going from a uniprocessor to an SMP is to the scheduler function. Another benefit 
of the SMP is that it usually takes up less physical space and draws less power than 
a comparable cluster. A final important benefit is that the SMP products are well es- 
tablished and stable. 

Over the long run, however, the advantages of the cluster approach are likely 
to result in clusters dominating the high-performance server market. Clusters are 
far superior to SMPs in terms of incremental and absolute scalability. Clusters are 
also superior in terms of availability, because all components of the system can read- 
ily be made highly redundant. 


17.6 NONUNIFORM MEMORY ACCESS 


In terms of commercial products, the two common approaches to providing a 
multiple-processor system to support applications are SMPs and clusters. For some 
years, another approach, known as nonuniform memory access (NUMA), has been 
the subject of research and commercial NUMA products are now available. 

Before proceeding, we should define some terms often found in the NUMA 
literature. 


e Uniform memory access (UMA): All processors have access to all parts of 
main memory using loads and stores. The memory access time of a processor 
to all regions of memory is the same. The access times experienced by different 
processors are the same. The SMP organization discussed in Sections 17.2 and 
17.3 is UMA. 


e Nonuniform memory access (NUMA): All processors have access to all parts 
of main memory using loads and stores. The memory access time of a proces- 
sor differs depending on which region of main memory is accessed. The last 
statement is true for all processors; however, for different processors, which 
memory regions are slower and which are faster differ. 


e Cache-coherent NUMA (CC-NUMA): A NUMA system in which cache co- 
herence is maintained among the caches of the various processors. 


A NUMA system without cache coherence is more or less equivalent to a cluster. 
The commercial products that have received much attention recently are CC-NUMA 
systems, which are quite distinct from both SMPs and clusters. Usually, but unfortu- 
nately not always, such systems are in fact referred to in the commercial literature as 
CC-NUMA systems. This section is concerned only with CC-NUMA systems. 


Motivation 


With an SMP system, there is a practical limit to the number of processors that can 
be used. An effective cache scheme reduces the bus traffic between any one proces- 
sor and main memory. As the number of processors increases, this bus traffic also in- 
creases. Also, the bus is used to exchange cache-coherence signals, further adding to 
the burden. At some point, the bus becomes a performance bottleneck. Perfor- 
mance degradation seems to limit the number of processors in an SMP configuration 
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to somewhere between 16 and 64 processors. For example, Silicon Graphics’ Power 
Challenge SMP is limited to 64 R10000 processors in a single system; beyond this 
number performance degrades substantially. 

The processor limit in an SMP is one of the driving motivations behind the de- 
velopment of cluster systems. However, with a cluster, each node has its own private 
main memory; applications do not see a large global memory. In effect, coherency is 
maintained in software rather than hardware. This memory granularity affects perfor- 
mance and, to achieve maximum performance, software must be tailored to this envi- 
ronment. One approach to achieving large-scale multiprocessing while retaining the 
flavor of SMP is NUMA. For example, the Silicon Graphics Origin NUMA system is 
designed to support up to 1024 MIPS R10000 processors [WHIT97] and the Sequent 
NUMA-Q system is designed to support up to 252 Pentium II processors [LOVE96]. 

The objective with NUMA is to maintain a transparent system wide memory 
while permitting multiple multiprocessor nodes, each with its own bus or other 
internal interconnect system. 


Organization 


Figure 17.13 depicts a typical CC-NUMA organization. There are multiple indepen- 
dent nodes, each of which is, in effect, an SMP organization. Thus, each node con- 
tains multiple processors, each with its own L1 and L2 caches, plus main memory. 
The node is the basic building block of the overall CC-NUMA organization. For 
example, each Silicon Graphics Origin node includes two MIPS R10000 processors; 
each Sequent NUMA-Q node includes four Pentium II processors. The nodes are 
interconnected by means of some communications facility, which could be a switch- 
ing mechanism, a ring, or some other networking facility. 

Each node in the CC-NUMA system includes some main memory. From the 
point of view of the processors, however, there is only a single addressable memory, 
with each location having a unique system wide address. When a processor initiates 
a memory access, if the requested memory location is not in that processor’s cache, 
then the L2 cache initiates a fetch operation. If the desired line is in the local portion 
of the main memory, the line is fetched across the local bus. If the desired line is in a 
remote portion of the main memory, then an automatic request is sent out to fetch 
that line across the interconnection network, deliver it to the local bus, and then 
deliver it to the requesting cache on that bus. All of this activity is automatic and 
transparent to the processor and its cache. 

In this configuration, cache coherence is a central concern. Although imple- 
mentations differ as to details, in general terms we can say that each node must 
maintain some sort of directory that gives it an indication of the location of various 
portions of memory and also cache status information. To see how this scheme 
works, we give an example taken from [PFIS98]. Suppose that processor 3 on node 
2 (P2-3) requests a memory location 798, which is in the memory of node 1. The fol- 
lowing sequence occurs: 


1. P2-3 issues a read request on the snoopy bus of node 2 for location 798. 


2. The directory on node 2 sees the request and recognizes that the location is in 
node 1. 
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Figure 17.13 CC-NUMA Organization 


3. Node 2’s directory sends a request to node 1, which is picked up by node 1’s 
directory. 


4. Node 1’s directory, acting as a surrogate of P2-3, requests the contents of 798, 
as if it were a processor. 


5. Node 1’s main memory responds by putting the requested data on the bus. 
6. Node 1’s directory picks up the data from the bus. 
7. The value is transferred back to node 2’s directory. 


8. Node 2’s directory places the data back on node 2’s bus, acting as a surrogate 
for the memory that originally held it. 


9. The value is picked up and placed in P2-3’s cache and delivered to P2-3. 
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The preceding sequence explains how data are read from a remote memory 
using hardware mechanisms that make the transaction transparent to the processor. 
On top of these mechanisms, some form of cache coherence protocol is needed. Var- 
ious systems differ on exactly how this is done. We make only a few general remarks 
here. First, as part of the preceding sequence, node 1’s directory keeps a record that 
some remote cache has a copy of the line containing location 798. Then, there needs 
to be a cooperative protocol to take care of modifications. For example, if a modifi- 
cation is done in a cache, this fact can be broadcast to other nodes. Each node’s di- 
rectory that receives such a broadcast can then determine if any local cache has that 
line and, if so, cause it to be purged. If the actual memory location is at the node re- 
ceiving the broadcast notification, then that node’s directory needs to maintain an 
entry indicating that that line of memory is invalid and remains so until a write back 
occurs. If another processor (local or remote) requests the invalid line, then the local 
directory must force a write back to update memory before providing the data. 


NUMA Pros and Cons 


The main advantage of a CC-NUMA system is that it can deliver effective perfor- 
mance at higher levels of parallelism than SMP, without requiring major software 
changes. With multiple NUMA nodes, the bus traffic on any individual node is lim- 
ited to a demand that the bus can handle. However, if many of the memory accesses 
are to remote nodes, performance begins to break down. There is reason to believe 
that this performance breakdown can be avoided. First, the use of L1 and L2 caches 
is designed to minimize all memory accesses, including remote ones. If much of the 
software has good temporal locality, then remote memory accesses should not be 
excessive. Second, if the software has good spatial locality, and if virtual memory is 
in use, then the data needed for an application will reside on a limited number of 
frequently used pages that can be initially loaded into the memory local to the run- 
ning application. The Sequent designers report that such spatial locality does appear 
in representative applications [LOVE96]. Finally, the virtual memory scheme can be 
enhanced by including in the operating system a page migration mechanism that 
will move a virtual memory page to a node that is frequently using it; the Silicon 
Graphics designers report success with this approach [WHIT97]. 

Even if the performance breakdown due to remote access is addressed, there 
are two other disadvantages for the CC-NUMA approach. Two in particular are dis- 
cussed in detail in [PFIS98]. First,a CC-NUMA does not transparently look like an 
SMP; software changes will be required to move an operating system and applica- 
tions from an SMP to a CC-NUMA system. These include page allocation, already 
mentioned, process allocation, and load balancing by the operating system. A 
second concern is that of availability. This is a rather complex issue and depends 
on the exact implementation of the CC-NUMA system; the interested reader is 
referred to [PFIS98]. 
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17.7 VECTOR COMPUTATION 


Although the performance of mainframe general-purpose computers continues to 
improve relentlessly, there continue to be applications that are beyond the reach of 
the contemporary mainframe. There is a need for computers to solve mathematical 
problems of physical processes, such as occur in disciplines including aerodynamics, 
seismology, meteorology, and atomic, nuclear, and plasma physics. 

Typically, these problems are characterized by the need for high precision and 
a program that repetitively performs floating-point arithmetic operations on 
large arrays of numbers. Most of these problems fall into the category known as 
continuous-field simulation. In essence, a physical situation can be described by a 
surface or region in three dimensions (e.g., the flow of air adjacent to the surface of 
a rocket). This surface is approximated by a grid of points. A set of differential equa- 
tions defines the physical behavior of the surface at each point. The equations are 
represented as an array of values and coefficients, and the solution involves re- 
peated arithmetic operations on the arrays of data. 

Supercomputers were developed to handle these types of problems. These ma- 
chines are typically capable of billions of floating-point operations per second. In 
contrast to mainframes, which are designed for multiprogramming and intensive I/O, 
the supercomputer is optimized for the type of numerical calculation just described. 

The supercomputer has limited use and, because of its price tag, a limited mar- 
ket. Comparatively few of these machines are operational, mostly at research 
centers and some government agencies with scientific or engineering functions. As 
with other areas of computer technology, there is a constant demand to increase the 
performance of the supercomputer. Thus, the technology and performance of the 
supercomputer continues to evolve. 

There is another type of system that has been designed to address the need for 
vector computation, referred to as the array processor. Although a supercomputer is 
optimized for vector computation, it is a general-purpose computer, capable of han- 
dling scalar processing and general data processing tasks. Array processors do not 
include scalar processing; they are configured as peripheral devices by both main- 
frame and minicomputer users to run the vectorized portions of programs. 


Approaches to Vector Computation 


The key to the design of a supercomputer or array processor is to recognize that the 
main task is to perform arithmetic operations on arrays or vectors of floating-point 
numbers. In a general-purpose computer, this will require iteration through each 
element of the array. For example, consider two vectors (one-dimensional arrays) of 
numbers, A and B. We would like to add these and place the result in C. In the ex- 
ample of Figure 17.14, this requires six separate additions. How could we speed up 
this computation? The answer is to introduce some form of parallelism. 

Several approaches have been taken to achieving parallelism in vector computa- 
tion. We illustrate this with an example. Consider the vector multiplication C = A X B, 
where A, B, and C are N X Nmatrices. The formula for each element of C is 


N 
Cij = Sai x x Dy; 
= 
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1.5 2.0 3.5 
7.1 39.7 46.8 
6.9 ais 1000.003 | _ | 1006.093 
100.5 11 111.5 
0 21.1 21.1 
59.7 19.7 79.4 
A + B = Cc 


Figure 17.14 Example of Vector Addition 


where A, B, and C have elements a;;, b;;, and c;;, respectively. Figure 17.15a 
shows a FORTRAN program for this computation that can be run on an ordi- 
nary scalar processor. 

One approach to improving performance can be referred to as vector process- 
ing. This assumes that it is possible to operate on a one-dimensional vector of data. 
Figure 17.15b is a FORTRAN program with a new form of instruction that allows 


DO 1001 = 1,N 

DO 100J = 1,N 

C(I, J) = 0.0 

DO 100K = 1,N 

C(I, J) = Cd, J) + AC, K) + B(K, J) 
CONTINUE 


100 


(a) Scalar processing 


DO 1001 = 1,N 

Cd, J) = 0.0 J = 1, N) 

DO 100K = 1,N 

C(I, J) = Cd, J) + AC, K) + B(K, J) J = 1, N) 


100 CONTINUE 


(b) Vector processing 


DO50J=1,N-—1 
FORK 100 
50 CONTINUE 
J=N 
DO 2001 = 1,N 
C(I, J) = 0.0 
DO 200 K = 1,N 
C(I, J) = Cd, J) + AC, K) + BK, J) 
CONTINUE 


100 


200 


(c) Parallel processing 


Figure 17.15 Matrix Multiplication (C = A X B) 
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vector computation to be specified. The notation (J = 1, N) indicates that opera- 
tions on all indices J in the given interval are to be carried out as a single operation. 
How this can be achieved is addressed shortly. 

The program in Figure 17.15b indicates that all the elements of the ith row are 
to be computed in parallel. Each element in the row is a summation, and the sum- 
mations (across K) are done serially rather than in parallel. Even so, only N’ vector 
multiplications are required for this algorithm as compared with N° scalar multipli- 
cations for the scalar algorithm. 

Another approach, parallel processing, is illustrated in Figure 17.15c. This ap- 
proach assumes that we have N independent processors that can function in paral- 
lel. To utilize processors effectively, we must somehow parcel out the computation 
to the various processors. Two primitives are used. The primitive FORK n causes an 
independent process to be started at location n. In the meantime, the original 
process continues execution at the instruction immediately following the FORK. 
Every execution of a FORK spawns a new process. The JOIN instruction is essen- 
tially the inverse of the FORK. The statement JOIN N causes N independent 
processes to be merged into one that continues execution at the instruction follow- 
ing the JOIN. The operating system must coordinate this merger, and so the execu- 
tion does not continue until all N processes have reached the JOIN instruction. 

The program in Figure 17.15c is written to mimic the behavior of the vector- 
processing program. In the parallel processing program, each column of C is com- 
puted by a separate process. Thus, the elements in a given row of C are computed 
in parallel. 

The preceding discussion describes approaches to vector computation in logi- 
cal or architectural terms. Let us turn now to a consideration of types of processor 
organization that can be used to implement these approaches. A wide variety of 
organizations have been and are being pursued. Three main categories stand out: 


e Pipelined ALU 
e Parallel ALUs 


e Parallel processors 


Figure 17.16 illustrates the first two of these approaches. We have already dis- 
cussed pipelining in Chapter 12. Here the concept is extended to the operation of 
the ALU. Because floating-point operations are rather complex, there is opportu- 
nity for decomposing a floating-point operation into stages, so that different stages 
can operate on different sets of data concurrently. This is illustrated in Figure 17.17a. 
Floating-point addition is broken up into four stages (see Figure 9.22): compare, 
shift, add, and normalize. A vector of numbers is presented sequentially to the first 
stage. As the processing proceeds, four different sets of numbers will be operated on 
concurrently in the pipeline. 

It should be clear that this organization is suitable for vector processing. To see 
this, consider the instruction pipelining described in Chapter 12. The processor goes 
through a repetitive cycle of fetching and processing instructions. In the absence of 
branches, the processor is continuously fetching instructions from sequential loca- 
tions. Consequently, the pipeline is kept full and a savings in time is achieved. Simi- 
larly, a pipelined ALU will save time only if it is fed a stream of data from sequential 
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Input 
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(a) Pipelined ALU 


Input 
registers 


a 


Memory Output 
register 


| 


(b) Parallel ALUs 


Figure 17.16 Approaches to Vector Computation 


locations. A single, isolated floating-point operation is not speeded up by a pipeline. 
The speedup is achieved when a vector of operands is presented to the ALU. The 
control unit cycles the data through the ALU until the entire vector is processed. 

The pipeline operation can be further enhanced if the vector elements are 
available in registers rather than from main memory. This is in fact suggested by 
Figure 17.16a. The elements of each vector operand are loaded as a block into a vec- 
tor register, which is simply a large bank of identical registers. The result is also 
placed in a vector register. Thus, most operations involve only the use of registers, 
and only load and store operations and the beginning and end of a vector operation 
require access to memory. 

The mechanism illustrated in Figure 17.17 could be referred to as pipelining 
within an operation. That is, we have a single arithmetic operation (e.g.,C = A + B) 
that is to be applied to vector operands, and pipelining allows multiple vector ele- 
ments to be processed in parallel. This mechanism can be augmented with pipelining 
across operations. In this latter case, there is a sequence of arithmetic vector opera- 
tions, and instruction pipelining is used to speed up processing. One approach to 
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Cc S A N 
Xx;—»| Compare Shift Add ; 
Yi—>j| exponent >| significand >| significands >| Normalize ->> 4 


X2Y2—>|C|S 

X39 Y3—>| C 
X4,Y4—>)| C| S| A] N | %4 

X5)¥5—>|C}| S| Al N | %5 


Ji+3 : 
A 
N) 


*p)¥1 —|C| S| A} Np %1 
(a) Pipelined ALU xy y2 »!ElslAl Nee 
*3, 93 —>|C| S| A} NP 23 
X4,Y4 —>| C] S| A| NEO %4 
X55 Y5 C|S|A| NPR %s 
X69 Y6 C|S|A|NP>%o 
X77 C|S|A|Ne}% 
Xs Vg C|S|A|Np%s 
X9,¥9 —>| C|S|A|N Z9 
X10 Y10 —>| C| S| AI N 210 
p11 —>|C|S|A|N Zu 
X12912 >| C] S|A|N 212 
(b) Four parallel ALUs 


Figure 17.17 Pipelined Processing of Floating-Point Operations 


this, referred to as chaining, is found on the Cray supercomputers. The basic rule for 
chaining is this: A vector operation may start as soon as the first element of the 
operand vector(s) is available and the functional unit (e.g., add, subtract, multiply, 
divide) is free. Essentially, chaining causes results issuing from one functional unit to 
be fed immediately into another functional unit and so on. If vector registers are 
used, intermediate results do not have to be stored into memory and can be used 
even before the vector operation that created them runs to completion. 

For example, when computing C = (s X A) + B, where A, B, and C are vec- 
tors and s is a scalar, the Cray may execute three instructions at once. Elements 
fetched for a load immediately enter a pipelined multiplier, the products are sent to 
a pipelined adder, and the sums are placed in a vector register as soon as the adder 
completes them: 


1. Vector load A-— Vector Register (VR1) 
2. Vector load B— VR2 
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3. Vector multiply s X VR1—VR3 
4. Vector add VR3 + VR2— VR4 
5. Vector store VR4—-C 


Instructions 2 and 3 can be chained (pipelined) because they involve different mem- 
ory locations and registers. Instruction 4 needs the results of instructions 2 and 3, but 
it can be chained with them as well. As soon as the first elements of vector registers 
2 and 3 are available, the operation in instruction 4 can begin. 

Another way to achieve vector processing is by the use of multiple ALUs in a 
single processor, under the control of a single control unit. In this case, the control 
unit routes data to ALUs so that they can function in parallel. It is also possible to 
use pipelining on each of the parallel ALUs. This is illustrated in Figure 17.17b. The 
example shows a case in which four ALUs operate in parallel. 

As with pipelined organization, a parallel ALU organization is suitable for 
vector processing. The control unit routes vector elements to ALUs in a round-robin 
fashion until all elements are processed. This type of organization is more complex 
than a single-ALU CPI. 

Finally, vector processing can be achieved by using multiple parallel proces- 
sors. In this case, it is necessary to break the task up into multiple processes to be 
executed in parallel. This organization is effective only if the software and hardware 
for effective coordination of parallel processors is available. 

We can expand our taxonomy of Section 17.1 to reflect these new structures, as 
shown in Figure 17.18. Computer organizations can be distinguished by the pres- 
ence of one or more control units. Multiple control units imply multiple processors. 
Following our previous discussion, if the multiple processors can function coopera- 
tively on a given task, they are termed parallel processors. 

The reader should be aware of some unfortunate terminology likely to be en- 
countered in the literature. The term vector processor is often equated with a 
pipelined ALU organization, although a parallel ALU organization is also designed 
for vector processing, and, as we have discussed, a parallel processor organization 
may also be designed for vector processing. Array processing is sometimes used to 
refer to a parallel ALU, although, again, any of the three organizations is optimized 
for the processing of arrays. To make matters worse, array processor usually refers to 
an auxiliary processor attached to a general-purpose processor and used to perform 
vector computation. An array processor may use either the pipelined or parallel 
ALU approach. 

At present, the pipelined ALU organization dominates the marketplace. 
Pipelined systems are less complex than the other two approaches. Their control 


Single control unit Multiple control unit 
Uniprocessor Pipelined ALU Parallel ALUs Multiprocessor Parallel processors 


Figure 17.18 A Taxonomy of Computer Organizations 


670 CHAPTER 17 / PARALLEL PROCESSING 


unit and operating system design are well developed to achieve efficient resource 
allocation and high performance. The remainder of this section is devoted to a more 
detailed examination of this approach, using a specific example. 


IBM 3090 Vector Facility 


A good example of a pipelined ALU organization for vector processing is the vector 
facility developed for the IBM 370 architecture and implemented on the high-end 
3090 series [PADE88, TUCK87]. This facility is an optional add-on to the basic sys- 
tem but is highly integrated with it. It resembles vector facilities found on super- 
computers, such as the Cray family. 

The IBM facility makes use of a number of vector registers. Each register is ac- 
tually a bank of scalar registers. To compute the vector sum C = A + B, the vectors 
A and B are loaded into two vector registers. The data from these registers are passed 
through the ALU as fast as possible, and the results are stored in a third vector regis- 
ter. The computation overlap, and the loading of the input data into the registers in a 
block, results in a significant speeding up over an ordinary ALU operation. 


ORGANIZATION The IBM vector architecture, and similar pipelined vector ALUs, 
provides increased performance over loops of scalar arithmetic instructions in 
three ways: 


e The fixed and predetermined structure of vector data permits housekeeping 
instructions inside the loop to be replaced by faster internal (hardware or mi- 
crocoded) machine operations. 


e Data-access and arithmetic operations on several successive vector elements 
can proceed concurrently by overlapping such operations in a pipelined design 
or by performing multiple-element operations in parallel. 


e The use of vector registers for intermediate results avoids additional stor- 
age reference. 


Figure 17.19 shows the general organization of the vector facility. Although the 
vector facility is seen to be a physically separate add-on to the processor, its architec- 
ture is an extension of the System/370 architecture and is compatible with it. The vec- 
tor facility is integrated into the System/370 architecture in the following ways: 


e Existing System/370 instructions are used for all scalar operations. 


e Arithmetic operations on individual vector elements produce exactly the same 
result as do corresponding System/370 scalar instructions. For example, one 
design decision concerned the definition of the result in a floating-point 
DIVIDE operation. Should the result be exact, as it is for scalar floating-point 
division, or should an approximation be allowed that would permit higher- 
speed implementation but could sometimes introduce an error in one or more 
low-order bit positions? The decision was made to uphold complete compati- 
bility with the System/370 architecture at the expense of a minor performance 
degradation. 


e Vector instructions are interruptible, and their execution can be resumed from 
the point of interruption after appropriate action has been taken, in a manner 
compatible with the System/370 program-interruption scheme. 
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Figure 17.19 IBM 3090 with Vector Facility 


e Arithmetic exceptions are the same as, or extensions of, exceptions for the 
scalar arithmetic instructions of the System/370, and similar fix-up routines 
can be used. To accommodate this, a vector interruption index is employed 
that indicates the location in a vector register that is affected by an exception 
(e.g., overflow). Thus, when execution of the vector instruction resumes, the 
proper place in a vector register is accessed. 


e Vector data reside in virtual storage, with page faults being handled in a stan- 
dard manner. 


This level of integration provides a number of benefits. Existing operating sys- 
tems can support the vector facility with minor extensions. Existing application pro- 
grams, language compilers, and other software can be run unchanged. Software that 
could take advantage of the vector facility can be modified as desired. 


REGISTERS A key issue in the design of a vector facility is whether operands are 
located in registers or memory. The IBM organization is referred to as register to 
register, because the vector operands, both input and output, can be staged in vec- 
tor registers. This approach is also used on the Cray supercomputer. An alterna- 
tive approach, used on Control Data machines, is to obtain operands directly 
from memory. The main disadvantage of the use of vector registers is that the 
programmer or compiler must take them into account for good performance. For 
example, suppose that the length of the vector registers is K and the length of the 
vectors to be processed is N > K. In this case, a vector loop must be performed, 
in which the operation is performed on K elements at a time and the loop is re- 
peated N/K times. The main advantage of the vector register approach is that the 
operation is decoupled from slower main memory and instead takes place pri- 
marily with registers. 
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FORTRAN ROUTINE: 


DO 100 J = 1,50 
CRG) = AR(J) * BRV) — AI(J) * BIW) 
100 CJ) = ARV) * BIG) + AI(J) * BRVJ) 


Operation Cycles Operation Cycles 

AR(J) * BRV) > T1() 3 ARG) 7 VI) 1 

Al) * BIG) => T2() 3 BR(J) > V2(J) 1 

T1Q) — T2(J) > CR) 3 V1) * V2) > V3) l 

AR(J) * BI) > T3() 3 AI) > VA) ! 

AI(J) * BRI) > T40) 3 BIJ) > V5(J) 

T3) + T4J) > CI) V4(J) * V5) > V6() 1 
V3(J) — V6(J) > V7) 1 

TOTAL 18 Vi(J) —> CRG) 1 
V1(J) * V5) > V8) 1 

(a) Storage to storage VA) * V2) > V9) 1 
V8(J) + V9(I) = VOU) 1 
Operation Cycles Vo) > CI) 

AR(J) > V1) I TOTAL Je 

V1(J) * BRJ V2 1 P 2 

Al a Q) & eR (b) Register to register 

V3(J) * BIJ) — V4(J) ey 7.7 <a as 

V2) — V4() > V5() i Operation Cycles 

ae a) i ARG) +> V1) 1 

V1(J) * BIJ) — V6(J) 1 VI) * BRS) > V2) 1 

V4(J) * BRI) > V7(J) 1 AI() > V3() 1 

ee eens oy I V2) — V3) * BIG) > V20) ; 
V2(J) — CR(J) 1 
V1(J) * BIG) > V4) 

Oise 10 V4) + V3) * BRU) 3 V5) i 

(c) Storage to register VSO) = ON) 

Vi = Vector registers TOTAL 8 

AR, BR, AI, BI = Operands in memory ! i 

Ti = Temporary locations in memory (d) Compound instruction 


Figure 17.20 Alternative Programs for Vector Calculation 


The speedup that can be achieved using registers is demonstrated in Figure 
17.20. The FORTRAN routine multiplies vector A by vector B to produce vector 
C, where each vector has a real part (AR, BR, CR) and an imaginary part (AI, BI, 
Cl). The 3090 can perform one main-storage access per processor, or clock, cycle 
(either read or write); has registers that can sustain two accesses for reading and 
one for writing per cycle; and produces one result per cycle in its arithmetic unit. 
Let us assume the use of instructions that can specify two source operands and a 
result.’ Part a of the figure shows that, with memory-to-memory instructions, each 
iteration of the computation requires a total of 18 cycles. With a pure register-to- 


‘For the 370/390 architecture, the only three-operand instructions (register and storage instructions, RS) 
specify two operands in registers and one in memory. In part a of the example, we assume the existence 
of three-operand instructions in which all operands are in main memory. This is done for purposes of 
comparison and, in fact, such an instruction format could have been chosen for the vector architecture. 
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Figure 17.21 Registers for the IBM 3090 Vector Facility 


register architecture (part b), this time is reduced to 12 cycles. Of course, with 
register-to-register operation, the vector quantities must be loaded into the vec- 
tor registers prior to computation and stored in memory afterward. For large vec- 
tors, this fixed penalty is relatively small. Figure 17.20c shows that the ability to 
specify both storage and register operands in one instruction further reduces the 
time to 10 cycles per iteration. This latter type of instruction is included in the 
vector architecture.” 

Figure 17.21 illustrates the registers that are part of the IBM 3090 vector facil- 
ity. There are sixteen 32-bit vector registers. The vector registers can also be coupled 
to form eight 64-bit vector registers. Any register element can hold an integer or 
floating-point value. Thus, the vector registers may be used for 32-bit and 64-bit 
integer values, and 32-bit and 64-bit floating-point values. 


>Compound instructions, discussed subsequently, afford a further reduction. 
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The architecture specifies that each register contains from 8 to 512 scalar 
elements. The choice of actual length involves a design trade-off. The time to doa 
vector operation consists essentially of the overhead for pipeline startup and reg- 
ister filling plus one cycle per vector element. Thus, the use of a large number of 
register elements reduces the relative startup time for a computation. However, 
this efficiency must be balanced against the added time required for saving and 
restoring vector registers on a process switch and the practical cost and space lim- 
its. These considerations led to the use of 128 elements per register in the current 
3090 implementation. 

Three additional registers are needed by the vector facility. The vector-mask 
register contains mask bits that may be used to select which elements in the vector 
registers are to be processed for a particular operation. The vector-status register 
contains control fields, such as the vector count, that determine how many elements 
in the vector registers are to be processed. The vector-activity count keeps track of 
the time spent executing vector instructions. 


COMPOUND INSTRUCTIONS As was discussed previously, instruction execution 
can be overlapped using chaining to improve performance. The designers of the 
IBM vector facility chose not to include this capability for several reasons. The Sys- 
tem/370 architecture would have to be extended to handle complex interruptions 
(including their effect on virtual memory management), and corresponding 
changes would be needed in the software. A more basic issue was the cost of in- 
cluding the additional controls and register access paths in the vector facility for 
generalized chaining. 

Instead, three operations are provided that combine into one instruction (one 
opcode) the most common sequences in vector computation, namely multiplication 
followed by addition, subtraction, or summation. The storage-to-register MULTIPLY- 
AND-ADD instruction, for example, fetches a vector from storage, multiplies it by 
a vector from a register, and adds the product to a third vector in a register. By use 
of the compound instructions MULTIPLY-AND-ADD and MULTIPLY-AND- 
SUBTRACT in the example of Figure 17.20, the total time for the iteration is 
reduced from 10 to 8 cycles. 

Unlike chaining, compound instructions do not require the use of additional 
registers for temporary storage of intermediate results, and they require one less 
register access. For example, consider the following chain: 


A—VRI 
VRI + VR2—VRI1 


In this case, two stores to the vector register VR1 are required. In the IBM architec- 
ture there is a storage-to-register ADD instruction. With this instruction, only the 
sum is placed in VR1. The compound instruction also avoids the need to reflect in 
the machine-state description the concurrent execution of a number of instructions, 
which simplifies status saving and restoring by the operating system and the han- 
dling of interrupts. 


THE INSTRUCTION SET Table 17.3 summarizes the arithmetic and logical opera- 
tions that are defined for the vector architecture. In addition, there are memory-to- 


¢L9 


fable 17.3 IBM 3090 Vector Facility: Arithmetic and Logical Instructions 


Data Types 


Floating-Point 


Operation Long Short Binary or Logical Operand Locations 


Add FL FS V+S—-V Q+V>V 
Subtract FL FS V-S-V Q-V-Vv 
Multiply FL FS VxV-V QxV-V 
Divide FL FS V/S-V Q/V-V 
Compare FL FS V:-S-V Q:V—-V 
Multiply and Add FL FS Vt+tVxS-V V+QxV->V 
Multiply and Subtract FL FS V-VxS-V V-QxV-V 
Multiply and Accumulate FL FS P+:S>V 

Complement FL FS 

Positive Absolute FL FS 

Negative Absolute FL FS 

Maximum FL FS 

Maximum Absolute FL FS 

Minimum 

Shift Left Logical 

Shift Right Logical 

And 

OR 

Exclusive-OR 


Explanation: Data Types Operand Locations 
FL Long floating point V_ Vector register 
FS Short floating point S Storage 
BI Binary integer Q Scalar (general or floating-point register) 
LO Logical P Partial sums in vector register 


Special operation 
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register load and register-to-memory store instructions. Note that many of the in- 
structions use a three-operand format. Also, many instructions have a number of 
variants, depending on the location of the operands. A source operand may be a 
vector register (V), storage (S), or a scalar register (Q). The target is always a vec- 
tor register, except for comparison, the result of which goes into the vector-mask 
register. With all these variants, the total number of opcodes (distinct instructions) 
is 171. This rather large number, however, is not as expensive to implement as 
might be imagined. Once the machine provides the arithmetic units and the data 
paths to feed operands from storage, scalar registers, and vector registers to the 
vector pipelines, the major hardware cost has been incurred. The architecture can, 
with little difference in cost, provide a rich set of variants on the use of those regis- 
ters and pipelines. 

Most of the instructions in Table 17.3 are self-explanatory. The two summa- 
tion instructions warrant further explanation. The accumulate operation adds to- 
gether the elements of a single vector (ACCUMULATE) or the elements of the 
product of two vectors (MULTIPLY-AND-ACCUMULATE). These instructions 
present an interesting design problem. We would like to perform this operation as 
rapidly as possible, taking full advantage of the ALU pipeline. The difficulty is that 
the sum of two numbers put into the pipeline is not available until several cycles 
later. Thus, the third element in the vector cannot be added to the sum of the first 
two elements until those two elements have gone through the entire pipeline. To 
overcome this problem, the elements of the vector are added in such a way as to 
produce four partial sums. In particular, elements 0, 4, 8, 12, ..., 124 are added in 
that order to produce partial sum 0; elements 1,5, 9,13, ..., 125 to partial sum 1; 
elements 2, 6, 10, 14,..., 126 to partial sum 2; and elements 3, 7, 11, 15,..., 127 to 
partial sum 4. Each of these partial sums can proceed through the pipeline at top 
speed, because the delay in the pipeline is roughly four cycles. A separate vector 
register is used to hold the partial sums. When all elements of the original vector 
have been processed, the four partial sums are added together to produce the final 
result. The performance of this second phase is not critical, because only four vec- 
tor elements are involved. 


17.8 RECOMMENDED READING AND WEB SITE 


[CATA94] surveys the principles of multiprocessors and examines SPARC-based SMPs in 
detail. SMPs are also covered in some detail in [STON93] and [HWAN93]. 

[MILE00] is an overview of cache coherence algorithms and techniques for multi- 
processors, with an emphasis on performance issues. Another survey of the issues relating to 
cache coherence in multiprocessors is [LILJ93]. [TOMA93] contains reprints of many of the 
key papers on the subject. 

[UNGE02] is an excellent survey of the concepts of multithreaded processors and chip 
multiprocessors. [UNGE03] is a lengthy survey of both proposed and current multithreaded 
processors that use explicit multithreading. 

A thorough treatment of clusters can be found in [BUYY99a] and [BUYY99b]. 
[WEYG01] is a less technical survey of clusters, with good commentary on various commer- 
cial products. [DESA05] describes IBM’s blade server architecture. 

Good discussions of vector computation can be found in [STON93] and [HWAN93]. 
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Recommended Web site: 


e IEEE Computer Society Task Force on Cluster Computing: An international forum to 
promote cluster computing research and education. 
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Key Terms 


active standby MESI protocol symmetric multiprocessor 
cache coherence multiprocessor (SMP) 
cluster nonuniform memory access uniform memory access 


directory protocol (NUMA) (UMA) 
failback passive standby uniprocessor 
failover snoopy protocol vector facility 
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Review Questions 


17.1 
17.2 
17.3 
17.4 
17.5 
17.6 
17.7 
17.8 
17.9 


List and briefly define three types of computer system organization. 

What are the chief characteristics of an SMP? 

What are some of the potential advantages of an SMP compared with a uniprocessor? 
What are some of the key OS design issues for an SMP? 

What is the difference between software and hardware cache coherent schemes? 
What is the meaning of each of the four states in the MESI protocol? 

What are some of the key benefits of clustering? 

What is the difference between failover and failback? 

What are the differences anong UMA, NUMA, and CC-NUMA? 


Problems 


17.1 


17.2 


17.5 


17.6 


Let a be the percentage of program code that can be executed simultaneously by 

n processors in a computer system. Assume that the remaining code must be exe- 

cuted sequentially by a single processor. Each processor has an execution rate of 

x MIPS. 

a. Derive an expression for the effective MIPS rate when using the system for exclu- 
sive execution of this program, in terms of n, a, and x. 

b. Ifnm = 16 and x = 4 MIPS, determine the value of a that will yield a system per- 
formance of 40 MIPS. 


A multiprocessor with eight processors has 20 attached tape drives. There are a 

large number of jobs submitted to the system that each require a maximum of four 

tape drives to complete execution. Assume that each job starts running with only 

three tape drives for a long period before requiring the fourth tape drive for a 

short period toward the end of its operation. Also assume an endless supply of 

such jobs. 

a. Assume the scheduler in the OS will not start a job unless there are four tape dri- 
ves available. When a job is started, four drives are assigned immediately and are 
not released until the job finishes. What is the maximum number of jobs that can 
be in progress at once? What are the maximum and minimum number of tape dri- 
ves that may be left idle as a result of this policy? 

b. Suggest an alternative policy to improve tape drive utilization and at the same 
time avoid system deadlock. What is the maximum number of jobs that can be in 
progress at once? What are the bounds on the number of idling tape drives? 

Can you foresee any problem with the write-once cache approach on bus-based mul- 

tiprocessors? If so, suggest a solution. 

Consider a situation in which two processors in an SMP configuration, over time, re- 

quire access to the same line of data from main memory. Both processors have a cache 

and use the MESI protocol. Initially, both caches have an invalid copy of the line. 

Figure 17.22 depicts the consequence of a read of line x by Processor P1. If this is the 

start of a sequence of accesses, draw the subsequent figures for the following sequence: 

1. P2 reads x. 

2. P1 writes to x (for clarity, label the line in P1’s cache x’). 

3. P1 writes to x (label the line in P1’s cache x”). 

4. P2 reads x. 

Figure 17.23 shows the state diagrams of two possible cache coherence protocols. 

Deduce and explain each protocol, and compare each to MESI. 

Consider an SMP with both L1 and L2 caches using the MESI protocol. As explained 

in Section 17.3, one of four states is associated with each line in the L2 cache. Are all 

four states also needed for each line in the L1 cache? If so, why? If not, explain which 
state or states can be eliminated. 
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Figure 17.22 MESI Example: Processor 1 Reads Line x 


17.7. An earlier version of the IBM mainframe, the S/390 G4, used three levels of cache. As 
with the z990, only the first level was on the processor chip [called the processor unit 
(PU)]. The L2 cache was also similar to the z990. An L3 cache was on a separate chip 
that acted as a memory controller, and was interposed between the L2 caches and the 
memory cards. Table 17.4 shows the performance of a three-level cache arrangement 
for the IBM S/390. The purpose of this problem is to determine whether the inclusion 
of the third level of cache seems worthwhile. Determine the access penalty (average 
number of PU cycles) for a system with only an L1 cache, and normalize that value to 
1.0. Then determine the normalized access penalty when both an L1 and L2 cache are 
used, and the access penalty when all three caches are used. Note the amount of 
improvement in each case and state your opinion on the value of the L3 cache. 


RW) 
Wj) Z(j) 


ZG) j j RG) 


W(i) = Write to line by processor i 

R(i) = Read line by processor i 

Z(i) = Displace line by cache i 

W(j) = Write to line by processor j (j # i) 
R(j) = Read line by processor j (j # i) 
Z(j) = Displace line by cache j (j # i) 


Note: State diagrams are for a 
given line in cache i 


wa 20) 


Figure 17.23 Two Cache Coherence Protocols 
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Table 17.4 Typical Cache Hit Rate on S/390 SMP Configuration [MAK97] 


Access Penalty 
Memory Subsystem (PU cycles) Cache Size Hit Rate (%) 


L1 cache 32 KB 


L2 cache 256 KB 


L3 cache 2 MB 


17.8 


17.9 
17.10 


17.11 


Memory 8 GB 


a. Consider a uniprocessor with separate data and instruction caches, with hit ratios of 
H, and H,, respectively. Access time from processor to cache is c clock cycles, and 
transfer time for a block between memory and cache is b clock cycles. Let f; be the 
fraction of memory accesses that are for instructions, and f, is the fraction of dirty 
lines in the data cache among lines replaced. Assume a write-back policy and deter- 
mine the effective memory access time in terms of the parameters just defined. 

b. Now assume a bus-based SMP in which each processor has the characteristics of 
part (a). Every processor must handle cache invalidation in addition to memory 
reads and writes. This affects effective memory access time. Let f,,,, be the fraction 
of data references that cause invalidation signals to be sent to other data caches. 
The processor sending the signal requires ¢ clock cycles to complete the invalida- 
tion operation. Other processors are not involved in the invalidation operation. 
Determine the effective memory access time. 

What organizational alternative is suggested by each of the illustrations in Figure 17.24? 

In Figure 17.8, some of the diagrams show horizontal rows that are partially filled. In 

other cases, there are rows that are completely blank. These represent two different 

types of loss of efficiency. Explain. 

Consider the pipeline depiction in Figure 12.13b, which is redrawn in Figure 17.25a, 

with the fetch and decode stages ignored, to represent the execution of thread A. Fig- 

ure 17.25b illustrates the execution of a separate thread B. In both cases, a simple 
pipelined processor is used. 

a. Show an instruction issue diagram, similar to Figure 17.8a, for each of the two threads. 

b. Assume that the two threads are to be executed in parallel on a chip multiproces- 
sor, with each of the two processors on the chip using a simple pipeline. Show an 


(a) (b) (c) (d) 


Figure 17.24 Diagram for Problem 18.9 


17.12 


d. 
e. 
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CO|FO| EI |wo 


1 1 | BI 
2 2 | B2| BI 
3 3 | B3 | B2| BI 
4 4 | B4 | B3 | B2 | BI 
5 5 B3 | B2 
e| 6 6 B3 
5| 7 7 | BS | B4 
8 8 | B6 | BS | B4 
9 9 | B7| Bo | BS | B4 
10 10 B7 | B6 | BS 
u u B7 | B6 
12 12 B7 
(b) 


Figure 17.25 Two Threads of Execution 


instruction issue diagram similar to Figure 17.8k. Also show a pipeline execution 
diagram in the style of Figure 17.25. 

Assume a two-issue superscalar architecture. Repeat part (b) for an interleaved 
multithreading superscalar implementation, assuming no data dependencies. 
Note: There is no unique answer; you need to make assumptions about latency 
and priority. 

Repeat part c for a blocked multithreading superscalar implementation. 

Repeat for a four-issue SMT architecture. 


The following code segment needs to be executed 64 times for the evaluation of the 
vector arithmetic expression: D(I) = A(I) + B(I) X C(D) for0 = I = 63. 


Load R1,B(I) = /R1<Memory(a + I)/ 
Load R2,C(1) =/R2<—Memory(8 + I)/ 
Multiply R1,R2 /R1<—(R1) x (R2) 
Load R3,A(I)_ /R3<—Memory(y + I)/ 
Add R3, R1 /R3<—(R3) + (R1)/ 
Load D1, R3 /Memory(@ + I) <—(R3)/ 


~~ 


where R1, R2, and R3 are processor registers, and a, B, y, 6 are the starting main 
memory addresses of arrays B(I), C(I), A(I), and D(1), respectively. Assume four 
clock cycles for each Load or Store, two cycles for the Add, and eight cycles for the 
Multiplier on either a uniprocessor or a single processor in an SIMD machine. 


a. 


Calculate the total number of processor cycles needed to execute this code seg- 
ment repeatedly 64 times on a SISD uniprocessor computer sequentially, ignoring 
all other time delays. 


. Consider the use of an SIMD computer with 64 processing elements to execute 


the vector operations in six synchronized vector instructions over 64-component 
vector data and both driven by the same-speed clock. Calculate the total execu- 
tion time on the SIMD machine, ignoring instruction broadcast and other delays. 
What is the speedup gain of the SIMD computer over the SISD computer? 
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17.13. Produce a vectorized version of the following program: 


DO 201 =1,N 
B(I, 1) =0 
DO 10J =1,M 


A(I) = A(D) + B(L,J) x C(L,J) 
10 CONTINUE 

D()) = E() + A() 
20 CONTINUE 


17.14 An application program is executed on a nine-computer cluster. A benchmark pro- 
gram took time T on this cluster. Further, it was found that 25% of T was time in 
which the application was running simultaneously on all nine computers. The remain- 
ing time, the application had to run on a single computer. 


17.15 


17.16 


a. 


Calculate the effective speedup under the aforementioned condition as compared 
to executing the program on a single computer. Also calculate a, the percentage 
of code that has been parallelized (programmed or compiled so as to use the clus- 
ter mode) in the preceding program. 


. Suppose that we are able to effectively use 17 computers rather than 9 comput- 


ers on the parallelized portion of the code. Calculate the effective speedup that 
is achieved. 


The following FORTRAN program is to be executed on a computer, and a parallel 
version is to be executed on a 32-computer cluster. 


Li: DO 101 = 1, 1024 

L2: SUM(I) = 0 

L3: DO 20J = 1,1 

L4: 20 SUM(I) = SUM(I) + 1 


LS: 10 CONTINUE 


Suppose lines 2 and 4 each take two machine cycle times, including all processor and 
memory-access activities. Ignore the overhead caused by the software loop control 
statements (lines 1,3,5) and all other system overhead and resource conflicts. 


a. 


b. 


d. 


What is the total execution time (in machine cycle times) of the program on a sin- 
gle computer? 

Divide the I-loop iterations among the 32 computers as follows: Computer 1 exe- 
cutes the first 32 iterations (I = 1 to 32), processor 2 executes the next 32 itera- 
tions, and so on. What are the execution time and speedup factor compared with 
part (a)? (Note that the computational workload, dictated by the J-loop, is unbal- 
anced among the computers.) 

Explain how to modify the parallelizing to facilitate a balanced parallel execution of 
all the computational workload over 32 computers. By a balanced load is meant an 
equal number of additions assigned to each computer with respect to both loops. 
What is the minimum execution time resulting from the parallel execution on 32 
computers? What is the resulting speedup over a single computer? 


Consider the following two versions of a program to add two vectors: 


Li:  DO10I=1,N DOALL K = 1,M 

12: A(1) = B(I) + C() DO 101 = L(K-1) + 1,KL 

L3: 10 CONTINUE A(1) = B(I) + C(1) 

LA: SUM = 0 10 CONTINUE 

LS: DO 20J =1,N SUM(K) = 0 

L6: SUM = SUM + A(J) DO 20J =1,L 

L7: 20 CONTINUE SUM(K) = SUM(K) + A(L(K-1) + J) 
20 CONTINUE 


ENDALL 
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The program on the left executes on a uniprocessor. Suppose each line of code L2, 
L4, and L6 takes one processor clock cycle to execute. For simplicity, ignore the 
time required for the other lines of code. Initially all arrays are already loaded in 
main memory and the short program fragment is in the instruction cache. How 
many clock cycles are required to execute this program? 

The program on the right is written to execute on a multiprocessor with M proces- 
sors. We partition the looping operations into M sections with L = N/M elements 
per section. DOALL declares that all M sections are executed in parallel. The re- 
sult of this program is to produce M partial sums. Assume that k clock cycles are 
needed for each interprocessor communication operation via the shared memory 
and that therefore the addition of each partial sum requires k cycles. An /-level bi- 
nary adder tree can merge all the partial sums, where / = log, M. How many cy- 
cles are needed to produce the final sum? 

Suppose N = 2”° elements in the array and M = 256. What is the speedup 
achieved by using the multiprocessor? Assume k = 200. What percentage is this 
of the theoretical speedup of a factor of 256? 
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KEY POINTS 


® A multicore computer, or chip multiprocessor, combines two or more proces- 
sors on a single computer chip. 


® The use of ever more complex single-processor chips has reached a limit 
due to hardware performance issues, including limits in instruction-level 
parallelism and power limitations. 


® On the other hand, the multicore architecture poses challenges to software 
developers to exploit the capability for multithreading across multiple cores. 


® The main variables in a multicore organization are the number of proces- 
sors on the chip, the number of levels of cache memory, and the extent to 
which cache memory is shared. 

® Another organizational design decision in a multicore system is whether 


the individual cores will be superscalar or will implement simultaneous 
multithreading (SMT). 


A multicore computer, also known as a chip multiprocessor, combines two or more 
processors (called cores) on a single piece of silicon (called a die). Typically, each 
core consists of all of the components of an independent processor, such as registers, 
ALU, pipeline hardware, and control unit, plus L1 instruction and data caches. In 
addition to the multiple cores, contemporary multicore chips also include L2 cache 
and, in some cases, L3 cache. 

This chapter provides an overview of multicore systems. When begin with a look 
at the hardware performance factors that led to the development of multicore comput- 
ers and the software challenges of exploiting the power of a multicore system. Next, we 
look at multicore organization. Finally, we examine two examples of multicore prod- 
ucts, those of Intel and ARM. 


18.1 HARDWARE PERFORMANCE ISSUES 


As we discuss in Chapter 2, microprocessor systems have experienced a steady, ex- 
ponential increase in execution performance for decades. Figure 2.12 shows that this 
increase is due partly to refinements in the organization of the processor on the 
chip, and partly to the increase in the clock frequency. 


Increase in Parallelism 


The organizational changes in processor design have primarily been focused on in- 
creasing instruction-level parallelism, so that more work could be done in each 
clock cycle. These changes include, in chronological order (Figure 18.1): 


e Pipelining: Individual instructions are executed through a pipeline of stages so 
that while one instruction is executing in one stage of the pipeline, another in- 
struction is executing in another stage of the pipeline. 
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(c) Multicore 


Figure 18.1 Alternative Chip Organizations 


e Superscalar: Multiple pipelines are constructed by replicating execution re- 
sources. This enables parallel execution of instructions in parallel pipelines, so 
long as hazards are avoided. 


e Simultaneous multithreading (SMT): Register banks are replicated so that 
multiple threads can share the use of pipeline resources. 


For each of these innovations, designers have over the years attempted to in- 
crease the performance of the system by adding complexity. In the case of pipelin- 
ing, simple three-stage pipelines were replaced by pipelines with five stages, and 
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then many more stages, with some implementations having over a dozen stages. 
There is a practical limit to how far this trend can be taken, because with more 
stages, there is the need for more logic, more interconnections, and more control sig- 
nals. With superscalar organization, performance increases can be achieved by in- 
creasing the number of parallel pipelines. Again, there are diminishing returns as 
the number of pipelines increases. More logic is required to manage hazards and to 
stage instruction resources. Eventually, a single thread of execution reaches the 
point where hazards and resource dependencies prevent the full use of the multiple 
pipelines available. This same point of diminishing returns is reached with SMT, as 
the complexity of managing multiple threads over a set of pipelines limits the num- 
ber of threads and number of pipelines that can be effectively utilized. 

Figure 18.2, from [OLUKO5], is instructive in this context. The upper graph 
shows the exponential increase in Intel processor performance over the years.! 
The middle graph is calculated by combining Intel’s published SPEC CPU figures 
and processor clock frequencies to give a measure of the extent to which perfor- 
mance improvement is due to increased exploitation of instruction-level paral- 
lelism. There is a flat region in the late 1980s before parallelism was exploited 
extensively. This is followed by a steep rise as designers were able to increasingly 
exploit pipelining, superscalar techniques, and SMT. But, beginning about 2000, a 
new flat region of the curve appears, as the limits of effective exploitation of 
instruction-level parallelism are reached. 

There is a related set of problems dealing with the design and fabrication of 
the computer chip. The increase in complexity to deal with all of the logical issues 
related to very long pipelines, multiple superscalar pipelines, and multiple SMT 
register banks means that increasing amounts of the chip area is occupied with co- 
ordinating and signal transfer logic. This increases the difficulty of designing, fabri- 
cating, and debugging the chips. The increasingly difficult engineering challenge 
related to processor logic is one of the reasons that an increasing fraction of the 
processor chip is devoted to the simpler memory logic. Power issues, discussed 
next, provide another reason. 


Power Consumption 


To maintain the trend of higher performance as the number of transistors per chip 
rise, designers have resorted to more elaborate processor designs (pipelining, super- 
scalar, SMT) and to high clock frequencies. Unfortunately, power requirements 
have grown exponentially as chip density and clock frequency have risen. This is 
shown in the lowest graph in Figure 18.2. 

One way to control power density is to use more of the chip area for cache 
memory. Memory transistors are smaller and have a power density an order of mag- 
nitude lower than that of logic (see Figure 18.3a). As Figure 18.3b, from [BORK03], 
shows, the percentage of the chip area devoted to memory has grown to exceed 50% 
as the chip transistor density has increased. 


'The data are based on published SPEC CPU figures from Intel, normalized across varying suites. 
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Figure 18.3 Power and Memory Considerations 


Figure 18.4, from [BORK07], shows where the power consumption trend is 
leading. By 2015, we can expect to see microprocessor chips with about 100 billion 
transistors on a 300 mm’ die. Assuming about 50-60% of the chip area is devoted to 
memory, the chip will support cache memory of about 100 MB and leave over 1 bil- 
lion transistors available for logic. 

How to use all those logic transistors is a key design issue. As discussed ear- 
lier in this section, there are limits to the effective use of such techniques as su- 
perscalar and SMT. In general terms, the experience of recent decades has been 


1012 


19011 


1919 


Total 


9 
10 transistors 


Transistors 


108 


Logic 
transistors 


107 


10° 
2001 2003 2005 2007 2009 2011 2013 2015 2017 


Figure 18.4 Chip Utilization of Transistors 
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encapsulated in a rule of thumb known as Pollack’s rule [POLL99], which states 
that performance increase is roughly proportional to square root of increase in 
complexity. In other words, if you double the logic in a processor core, then it deliv- 
ers only 40% more performance. In principle, the use of multiple cores has the po- 
tential to provide near-linear performance improvement with the increase in the 
number of cores. 

Power considerations provide another motive for moving toward a multi- 
core organization. Because the chip has such a huge amount of cache memory, it 
becomes unlikely that any one thread of execution can effectively use all that 
memory. Even with SMT, you are multithreading in a relatively limited fashion 
and cannot therefore fully exploit a gigantic cache, whereas a number of rela- 
tively independent threads or processes has a greater opportunity to take full ad- 
vantage of the cache memory. 


18.2 SOFTWARE PERFORMANCE ISSUES 


A detailed examination of the software performance issues related to multicore or- 
ganization is beyond our scope. In this section, we first provide an overview of these 
issues, and then look at an example of an application designed to exploit multicore 
capabilities. 


Software on Multicore 


The potential performance benefits of a multicore organization depend on the abil- 
ity to effectively exploit the parallel resources available to the application. Let us 
focus first on a single application running on a multicore system. Recall from 
Chapter 2 that Amdahl’s law states that: 


time to execute program on a single processor 


epeesup time to execute program on N parallel processors 
1 
= ———+ (18.1) 
1-f)+— 
a-f\+s 


The law assumes a program in which a fraction (1 — f) of the execution time in- 
volves code that is inherently serial and a fraction f that involves code that is infi- 
nitely parallelizable with no scheduling overhead. 

This law appears to make the prospect of a multicore organization attractive. 
But as Figure 18.5a shows, even a small amount of serial code has a noticeable im- 
pact. If only 10% of the code is inherently serial (f = 0.9), the running the program 
on a multicore system with 8 processors yields a performance gain of only a factor of 
4.7. In addition, software typically incurs overhead as a result of communication and 
distribution of work to multiple processors and cache coherence overhead. This re- 
sults in a curve where performance peaks than then begins to degrade because of 
the increased burden of the overhead of using multiple processors. Figure 18.5b, 
from [MCDO05], is a representative example. 
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Figure 18.5 Performance Effect of Multiple Cores 


However, software engineers have been addressing this problem and there are 
numerous applications in which it is possible to effectively exploit a multicore sys- 
tem. [MCDO05] reports on a set of database applications, in which great attention 
was paid to reducing the serial fraction within hardware architectures, operating 
systems, middleware, and the database application software. Figure 18.6 shows the 
result. As this example shows, database management systems and database applica- 
tions are one area in which multicore systems can be used effectively. Many kinds of 
servers can also effectively use the parallel multicore organization, because servers 
typically handle numerous relatively independent transactions in parallel. 
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Figure 18.6 Scaling of Database Workloads on Multiple-Processor Hardware 


In addition to general-purpose server software, a number of classes of applica- 


tions benefit directly from the ability to scale throughput with the number of cores. 
[MCDO06] lists the following examples: 


e Multithreaded native applications: Multithreaded applications are character- 
ized by having a small number of highly threaded processes. Examples of 
threaded applications include Lotus Domino or Siebel CRM (Customer Rela- 
tionship Manager). 

e Multiprocess applications: Multiprocess applications are characterized by the 
presence of many single-threaded processes. Examples of multi-process appli- 
cations include the Oracle database, SAP, and PeopleSoft. 


e Java applications: Java applications embrace threading in a fundamental way. 
Not only does the Java language greatly facilitate multithreaded applications, 
but the Java Virtual Machine is a multithreaded process that provides schedul- 
ing and memory management for Java applications. Java applications that can 
benefit directly from multicore resources include application servers such as 
Sun’s Java Application Server, BEA’s Weblogic, IBM’s Websphere, and the 
open-source Tomcat application server. All applications that use a Java 2 Plat- 
form, Enterprise Edition (J2EE platform) application server can immediately 
benefit from multicore technology. 


e Multiinstance applications: Even if an individual application does not scale to 
take advantage of a large number of threads, it is still possible to gain from 
multicore architecture by running multiple instances of the application in par- 
allel. If multiple application instances require some degree of isolation, virtu- 
alization technology (for the hardware of the operating system) can be used to 
provide each of them with its own separate and secure environment. 
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Application Example: Valve Game Software 


Valve is an entertainment and technology company that has developed a number of 
popular games, as well as the Source engine, one of the most widely played game en- 
gines available. Source is an animation engine used by Valve for its games and li- 
censed for other game developers. 

In recent years, Valve has reprogrammed the Source engine software to use 
multithreading to exploit the power of multicore processor chips from Intel and 
AMD [REIM06]. The revised Source engine code provides more powerful support 
for Valve games such as Half Life 2. 

From Valve’s perspective, threading granularity options are defined as fol- 
lows [HARRO6]: 


e Coarse threading: Individual modules, called systems, are assigned to individual 
processors. In the Source engine case, this would mean putting rendering on one 
processor, AI (artificial intelligence) on another, physics on another, and so on. This 
is straightforward. In essence, each major module is single threaded and the princi- 
pal coordination involves synchronizing all the threads with a timeline thread. 


e Fine-grained threading: Many similar or identical tasks are spread across mul- 
tiple processors. For example, a loop that iterates over an array of data can be 
split up into a number of smaller parallel loops in individual threads that can 
be scheduled in parallel. 


e Hybrid threading: This involves the selective use of fine-grain threading for 
some systems and single threading for other systems. 


Valve found that through coarse threading, it could achieve up to twice the perfor- 
mance across two processors compared to executing on a single processor. But this per- 
formance gain could only be achieved with contrived cases. For real-world gameplay, the 
improvement was on the order of a factor of 1.2. Valve also found that effective use of 
fine-grain threading was difficult. The time per work unit can be variable, and managing 
the timeline of outcomes and consequences involved complex programming. 

Valve found that a hybrid threading approach was the most promising and 
would scale the best as multicore systems with eight or sixteen processors became 
available. Valve identified systems that operate very effectively being permanently 
assigned to a single processor. An example is sound mixing, which has little user in- 
teraction, is not constrained by the frame configuration of windows, and works on its 
own set of data. Other modules, such as scene rendering, can be organized into a 
number of threads so that the module can execute on a single processor but achieve 
greater performance as it is spread out over more and more processors. 

Figure 18.7 illustrates the thread structure for the rendering module. In this hi- 
erarchical structure, higher-level threads spawn lower-level threads as needed. The 
rendering module relies on a critical part of the Source engine, the world list, which 
is a database representation of the visual elements in the game’s world. The first 
task is to determine what are the areas of the world that need to be rendered. The 
next task is to determine what objects are in the scene as viewed from multiple an- 
gles. Then comes the processor-intensive work. The rendering module has to work 
out the rendering of each object from multiple points of view, such as the player’s 
view, the view of TV monitors, and the point of view of reflections in water. 
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Figure 18.7 Hybrid Threading for Rendering Module 


Some of the key elements of the threading strategy for the rendering module 
are listed in [LEONO7] and include the following: 


e Construct scene-rendering lists for multiple scenes in parallel (e.g., the world 
and its reflection in water). 
e Overlap graphics simulation. 
e Compute character bone transformations for all characters in all scenes in parallel. 
e Allow multiple threads to draw in parallel. 
The designers found that simply locking key databases, such as the world list, for 
a thread was too inefficient. Over 95% of the time, a thread is trying to read from a data 


set, and only 5% of the time at most is spent in writing to a data set. Thus, a concurrency 
mechanism known as the single-writer-multiple-readers model works effectively. 


18.3 MULTICORE ORGANIZATION 


Ata top level of description, the main variables in a multicore organization are as follows: 
e The number of core processors on the chip 
e The number of levels of cache memory 


e The amount of cache memory that is shared 


18.3 / MULTICORE ORGANIZATION 695 


CPU Core 1 |. - «| CPU Coren CPU Core1 |. . -| CPU Coren 
I I I I I I I I 
L1-D]| L1-I L1-D| L1-I L1-D]| L1-I L1-D]| L1-I 
I I I I 
L2 cache L2 cache 


L2 cache (vo) 


Main memory @ 


Main memory 


(b) Dedicated L2 cache 
(a) Dedicated L1 cache 
CPU Core 1 |* * *| CPU Coren CPU Core 1 |* * *| CPU Coren 
I I I I 
I I L1-D][L1-1 L1-D][L1-1 
L1-D]| L1-I L1-D]| L1-I I I I I 
I I T I L2 cache L2 cache 
L2 cache L3 cache 


Main memory (vo) Main memory (Vo) 


(c) Shared L2 cache (d ) Shared L3 cache 


Figure 18.8 Multicore Organization Alternatives 


Figure 18.8 shows four general organizations for multicore systems. Figure 18.8a is 
an organization found in some of the earlier multicore computer chips and is still seen in 
embedded chips. In this organization, the only on-chip cache is L1 cache, with each core 
having its own dedicated L1 cache. Almost invariably, the L1 cache is divided into in- 
struction and data caches. An example of this organization is the ARM11 MPCore. 

The organization of Figure 18.8b is also one in which there is no on-chip cache 
sharing. In this, there is enough area available on the chip to allow for L2 cache. An 
example of this organization is the AMD Opteron. Figure 18.8c shows a similar allo- 
cation of chip space to memory, but with the use of a shared L2 cache. The Intel 
Core Duo has this organization. Finally, as the amount of cache memory available 
on the chip continues to grow, performance considerations dictate splitting off a sep- 
arate, shared L3 cache, with dedicated L1 and L2 caches for each core processor. 
The Intel Core 17 is an example of this organization. 

The use of a shared L2 cache on the chip has several advantages over exclusive 
reliance on dedicated caches: 


1. Constructive interference can reduce overall miss rates. That is, if a thread on 
one core accesses a main memory location, this brings the frame containing 
the referenced location into the shared cache. If a thread on another core soon 
thereafter accesses the same memory block, the memory locations will already 
be available in the shared on-chip cache. 
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2. A related advantage is that data shared by multiple cores is not replicated at the 
shared cache level. 


3. With proper frame replacement algorithms, the amount of shared cache allocated to 
each core is dynamic, so that threads that have a less locality can employ more cache. 


4. Interprocessor communication is easy to implement, via shared memory locations. 


5. The use of a shared L2 cache confines the cache coherency problem to the L1 
cache level, which may provide some additional performance advantage. 


A potential advantage to having only dedicated L2 caches on the chip is that 
each core enjoys more rapid access to its private L2 cache. This is advantageous for 
threads that exhibit strong locality. 

As both the amount of memory available and the number of cores grow, the 
use of a shared L3 cache combined with either a shared L2 cache or dedicated per- 
core L2 caches seems likely to provide better performance than simply a massive 
shared L2 cache. 

Another organizational design decision in a multicore system is whether the 
individual cores will be superscalar or will implement simultaneous multithreading 
(SMT). For example, the Intel Core Duo uses superscalar cores, whereas the Intel 
Core i7 uses SMT cores. SMT has the effect of scaling up the number of hardware- 
level threads that the multicore system supports. Thus, a multicore system with four 
cores and SMT that supports four simultaneous threads in each core appears the 
same to the application level as a multicore system with 16 cores. As software is de- 
veloped to more fully exploit parallel resources, an SMT approach appears to be 
more attractive than a superscalar approach. 


18.4 INTEL x86 MULTICORE ORGANIZATION 


Intel has introduced a number of multicore products in recent years. In this section, 
we look at two examples: the Intel Core Duo and the Intel Core 17. 


Intel Core Duo 


The Intel Core Duo, introduced in 2006, implements two x86 superscalar processors 
with a shared L2 cache (Figure 18.8c). 

The general structure of the Intel Core Duo is shown in Figure 18.9. Let us 
consider the key elements starting from the top of the figure. As is common in mul- 
ticore systems, each core has its own dedicated L1 cache. In this case, each core has 
a 32-KB instruction cache and a 32-KB data cache. 

Each core has an independent thermal control unit. With the high transistor 
density of today’s chips, thermal management is a fundamental capability, especially 
for laptop and mobile systems. The Core Duo thermal control unit is designed to 
manage chip heat dissipation to maximize processor performance within thermal 
constraints. Thermal management also improves ergonomics with a cooler system 
and lower fan acoustic noise. In essence, the thermal management unit monitors dig- 
ital sensors for high-accuracy die temperature measurements. Each core can be de- 
fined as in independent thermal zone. The maximum temperature for each thermal 
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Figure 18.9 Intel Core Duo Block Diagram 


zone is reported separately via dedicated registers that can be polled by software. If 
the temperature in a core exceeds a threshold, the thermal control unit reduces the 
clock rate for that core to reduce heat generation. 

The next key element of the Core Duo organization is the Advanced Pro- 
grammable Interrupt Controller (APIC). The APIC performs a number of func- 
tions, including the following: 


1. The APIC can provide interprocessor interrupts, which allow any process to inter- 
rupt any other processor or set of processors. In the case of the Core Duo, a thread 
in one core can generate an interrupt, which is accepted by the local APIC, routed 
to the APIC of the other core, and communicated as an interrupt to the other core. 


2. The APIC accepts I/O interrupts and routes these to the appropriate core. 


3. Each APIC includes a timer, which can be set by the OS to generate an inter- 
rupt to the local core. 


The power management logic is responsible for reducing power consumption 
when possible, thus increasing battery life for mobile platforms, such as laptops. In 
essence, the power management logic monitors thermal conditions and CPU activ- 
ity and adjusts voltage levels and power consumption appropriately. It includes an 
advanced power-gating capability that allows for an ultra fine-grained logic control 
that turns on individual processor logic subsystems only if and when they are needed. 
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Additionally, many buses and arrays are split so that data required in some modes of 
operation can be put in a low power state when not needed. 

The Core Duo chip includes a shared 2-MB L2 cache. The cache logic allows 
for a dynamic allocation of cache space based on current core needs, so that one 
core can be assigned up to 100% of the L2 cache. The L2 cache includes logic to sup- 
port the MESI protocol for the attached L1 caches. The key point to consider is 
when a cache write is done at the L1 level. A cache line gets the M state when a 
processor writes to it; if the line is not in E or M-state prior to writing it, the cache 
sends a Read-For-Ownership (RFO) request that ensures that the line exists in the 
L1 cache and is in the I state in the other L1 cache. The Intel Core Duo extends this 
protocol to take into account the case when there are multiple Core Duo chips or- 
ganized as a symmetric multiprocessor (SMP) system. The L2 cache controller allow 
the system to distinguish between a situation in which data are shared by the two 
local cores, but not with the rest of the world, and a situation in which the data are 
shared by one or more caches on the die as well as by an agent on the external bus 
(can be another processor). When a core issues an RFO, if the line is shared only by 
the other cache within the local die, we can resolve the RFO internally very fast, 
without going to the external bus at all. Only if the line is shared with another agent 
on the external bus do we need to issue the RFO externally. 

The bus interface connects to the external bus, known as the Front Side Bus, 
which connects to main memory, I/O controllers, and other processor chips. 


Intel Core i7 


The Intel Core i7, introduced in November of 2008, implements four x86 SMT proces- 
sors, each with a dedicated L2 cache, and with a shared L3 cache (Figure 18.8d). 

The general structure of the Intel Core i7 is shown in Figure 18.10. Each core 
has its own dedicated L2 cache and the four cores share an 8-MB L3 cache. One 
mechanism Intel uses to make its caches more effective is prefetching, in which 
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Figure 18.10 Intel Core i7 Block Diagram 
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Table 18.1 Cache Latency (in clock cycles) 


CPU Clock Frequency L1 Cache L2 Cache L3 Cache 


Core 2 Quad 2.66 GHz 3 cycles 15 cycles a 


Core 17 2.66 GHz 4 cycles 11 cycles 39 cycles 


the hardware examines memory access patterns and attempts to fill the caches spec- 
ulatively with data that’s likely to be requested soon. It is interesting to compare the 
performance of this three-level on chip cache organization with a comparable two- 
level organization from Intel. Table 18.1 shows the cache access latency, in terms of 
clock cycles for two Intel multicore systems running at the same clock frequency. 
The Core 2 Quad has a shared L2 cache, similar to the Core Duo. The Core 17 im- 
proves on L2 cache performance with the use of the dedicated L2 caches, and pro- 
vides a relatively high-speed access to the L3 cache. 

The Core i7 chip supports two forms of external communications to other 
chips. The DDR3 memory controller brings the memory controller for the DDR 
main memory” onto the chip. The interface supports three channels that are 8 bytes 
wide for a total bus width of 192 bits, for an aggregate data rate of up to 32 GB/s. 
With the memory controller on the chip, the Front Side Bus is eliminated. 

The QuickPath Interconnect (QPI) is a cache-coherent, point-to-point link 
based electrical interconnect specification for Intel processors and chipsets. It en- 
ables high-speed communications among connected processor chips. The QPI link 
operates at 6.4 GT/s (transfers per second). At 16 bits per transfer, that adds up to 
12.8 GB/s, and since QPI links involve dedicated bidirectional pairs, the total band- 
width is 25.6 GB/s. 


18.5 ARM11 MPCore 


The ARM11 MPCore is a multicore product based on the ARM11 processor family. 
The ARM11 MPCore can be configured with up to four processors, each with its 
own L1 instruction and data caches, per chip. Table 18.1 lists the configurable op- 
tions for the system, including the default values. 

Figure 18.11 presents a block diagram of the ARM11 MPCore. The key ele- 
ments of the system are as follows: 


e Distributed interrupt controller (DIC): Handles interrupt detection and inter- 
rupt prioritization. The DIC distributes interrupts to individual processors. 


e Timer: Each CPU has its own private timer that can generate interrupts. 


Watchdog: Issues warning alerts in the event of software failures. If the watch- 
dog is enabled, it is set to a predetermined value and counts down to 0. It is pe- 
riodically reset. If the watchdog value reaches zero, an alert is issued. 


e CPU interface: Handles interrupt acknowledgement, interrupt masking, and 
interrupt completion acknowledgement. 


?The DDR synchronous RAM memory is discussed in Chapter 5. 
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Figure 18.11 ARM11 MPCore Processor Block Diagram 


e CPU: A single ARM11 processor. Individual CPUs are referred to as MP11 
CPUs. 


e Vector floating-point (VFP) unit: A coprocessor that implements floating- 
point operations in hardware. 


e Li cache: Each CPU has its own dedicated L1 data cache and L1 instruction cache. 


e Snoop control unit (SCU): Responsible for maintaining coherency among L1 
data caches. 


Interrupt Handling 


The Distributed Interrupt Controller (DIC) collates interrupts from a large number 
of sources. It provides 

e Masking of interrupts 

e Prioritization of the interrupts 

e Distribution of the interrupts to the target MP11 CPUs 
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e Tracking the status of interrupts 


e Generation of interrupts by software 


The DIC isa single functional unit that is placed in the system alongside MP11 
CPUs. This enables the number of interrupts supported in the system to be indepen- 
dent of the MP11 CPU design. The DIC is memory mapped; that is, control registers 
for the DIC are defined relative to a main memory base address. The DIC is ac- 
cessed by the MP11 CPUs using a private interface through the SCU. 

The DIC is designed to satisfy two functional requirements: 


e Provide a means of routing an interrupt request to a single CPU or multiple 
CPUs, as required. 


e Provide a means of interprocessor communication so that a thread on one 
CPU can cause activity by a thread on another CPU. 


As an example that makes use of both requirements, consider a multithreaded 
application that has threads running on multiple processors. Suppose the applica- 
tion allocates some virtual memory. To maintain consistency, the operating system 
must update memory translation tables on all processors. The OS could update the 
tables on the processor where the virtual memory allocation took place, and then 
issue an interrupt to all the other processors running this application. The other 
processors could then use this interrupt’s ID to determine that they need to update 
their memory translation tables. 

The DIC can route an interrupt to one or more CPUs in the following three ways: 


e An interrupt can be directed to a specific processor only. 


e An interrupt can be directed to a defined group of processors. The MPCore 
views the first processor to accept the interrupt, typically the least loaded, as 
being best positioned to handle the interrupt. 


e An interrupt can be directed to all processors. 


From the point of view of software running on a particular CPU, the OS can gen- 
erate an interrupt to all but self, to self, or to specific other CPUs. For communication 
between threads running on different CPUs, the interrupt mechanism is typically com- 
bined with shared memory for message passing. Thus, when a thread is interrupted by 
an interprocessor communication interrupt, it reads from the appropriate block of 
shared memory to retrieve a message from the thread that triggered the interrupt. A 
total of 16 interrupt IDs per CPU are available for interprocessor communication. 

From the point of view of an MP11 CPU, an interrupt can be 


e Inactive: An Inactive interrupt is one that is nonasserted, or which in a multi- 
processing environment has been completely processed by that CPU but can 
still be either Pending or Active in some of the CPUs to which it is targeted, 
and so might not have been cleared at the interrupt source. 

e Pending: A Pending interrupt is one that has been asserted, and for which pro- 
cessing has not started on that CPU. 

e Active: An Active interrupt is one that has been started on that CPU, but pro- 
cessing is not complete. An Active interrupt can be pre-empted when a new in- 
terrupt of higher priority interrupts MP11 CPU interrupt processing. 
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Interrupts come from the following sources: 


e Interprocessor interrupts (IPIs): Each CPU has private interrupts, ID0-ID15, 
that can only be triggered by software. The priority of an IPI depends on the 
receiving CPU, not the sending CPU. 


e Private timer and/or watchdog interrupts: These use interrupt IDs 29 and 30. 


e Legacy FIQ line: In legacy IRQ mode, the legacy FIQ pin, on a per CPU basis, 
bypasses the Interrupt Distributor logic and directly drives interrupt requests 
into the CPU. 


e Hardware interrupts: Hardware interrupts are triggered by programmable 
events on associated interrupt input lines. CPUs can support up to 224 inter- 
rupt input lines. Hardware interrupts start at ID32. 


Figure 18.12 is a block diagram of the DIC. The DIC is configurable to support 
between 0 and 255 hardware interrupt inputs. The DIC maintains a list of interrupts, 
showing their priority and status. The Interrupt Distributor transmits to each CPU 
Interface the highest Pending interrupt for that interface. It receives back the infor- 
mation that the interrupt has been acknowledged, and can then change the status of 
the corresponding interrupt. The CPU Interface also transmits End of Interrupt 
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Figure 18.12 Interrupt Distributor Block Diagram 
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Table 18.2. ARM11MPCore Configurable Options 


Feature Range of Options Default Value 


Processors 1to4 


Instruction cache size per processor 16 KB, 32 KB, or 64 KB 
Data cache size per processor 16 KB, 32 KB, or 64 KB 
Master ports lor2 


Width of interrupt bus 0 to 224 by increments of 32 pins 32 pins 


Vector floating point (VFP) Included or not Included 
coprocessor per processor 


Information (EOJ), which enables the Interrupt Distributor to update the status of 
this interrupt from Active to Inactive. 


Cache Coherency 


The MPCore’s Snoop Control Unit (SCU) is designed to resolve most of the tradi- 
tional bottlenecks related to access to shared data and the scalability limitation in- 
troduced by coherence traffic. 

The L1 cache coherency scheme is based on the MESI protocol described in 
Chapter 17. The SCU monitors operations shared data to optimize MESI state mi- 
gration. The SCU introduces three types of optimization: direct data intervention, 
duplicated tag RAMs, and migratory lines. 

Direct data intervention (DDI) enables copying clean data from one CPU L1 
data cache to another CPU L1 data cache without accessing external memory. This 
reduces read after read activity from the Level 1 cache to the Level 2 cache. Thus, a 
local L1 cache miss is resolved in a remote L1 cache rather than from access to the 
shared L2 cache. 

Recall that main memory location of each line within a cache is identified by a 
tag for that line. The tags can be implemented as a separate block of RAM of the same 
length as the number of lines in the cache. In the SCU, duplicated tag RAMs are du- 
plicated versions of L1 tag RAMs used by the SCU to check for data availability be- 
fore sending coherency commands to the relevant CPUs. Coherency commands are 
sent only to CPUs that must update their coherent data cache. This reduces the power 
consumption and performance impact from snooping into and manipulating each 
processor’s cache on each memory update. Having tag data available locally lets the 
SCU limit cache manipulations to processors that have cache lines in common. 

The migratory lines feature enables moving dirty data from one CPU to another 
without writing to L2 and reading the data back in from external memory. The opera- 
tion can be described as follows. In a typical MESI protocol, one processor has a mod- 
ified line and another processor attempts to read that line, the following actions occur: 


1. The line contents are transferred from the modified line to the processor that 
initiated the read. 
2. The line contents are read back to main memory. 


3. The line is put in the shared state in both caches. 
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The MPCore SCU handles this situation differently. The SCU monitors the 
system for a migratory line. If one processor has a modified line, and another 
processor reads then writes to it, the SCU assumes such a location will experience 
this same operation in the future. As this operation starts again, the SCU will auto- 
matically move the cache line directly to an invalid state rather than expending en- 
ergy moving it first into the shared state. This optimization also causes the processor 
to transfer the cache line directly to the other processor without intervening exter- 
nal memory operations. 


18.6 RECOMMENDED READING AND WEB SITE 


Two books that provide good coverage of the issues in this chapter are [OLUKO7] and 
[JERRO5]. [GOCH06] and [MEND06] describe the Intel Core Duo. [FOGO08b] provides a 
detailed description of the Core Duo pipeline architecture. 

[ARMO8b] provides thorough coverage of the ARM Cortex-A8 pipeline. [HIRAO7] 
and [GOOD05] are good overview articles. 
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18.7 KEY TERMS, REVIEW QUESTIONS, AND PROBLEMS 


Key Terms 


Amdahl’s law multicore superscalar 
chip multiprocessor simultaneous multithreading (SMT) 


Review Questions 


18.1 Summarize the differences among simple instruction pipelining, superscalar, and si- 
multaneous multithreading. 


18.2 Give several reasons for the choice by designers to move to a multicore organization 
rather than increase parallelism within a single processor. 


18.3. Why is there a trend toward given an increasing fraction of chip area to cache memory? 


18.4 List some examples of applications that benefit directly from the ability to scale 
throughput with the number of cores. 


18.5 Ata top level, what are the main design variables in a multicore organization? 


18.6 List some advantages of a shared L2 cache among cores compared to separate dedi- 
cated L2 caches for each core. 


Problems 


18.1 Consider the following problem. A designer has available a chip and decided what 
fraction of the chip will be devoted to cache memory (L1, L2, L3). The remainder of 
the chip can be devoted to a single complex superscalar and/or SMT core or multiple 
somewhat simpler cores. Define the following parameters: 


n = maximum number of cores that can be contained on the chip 


k = actual number of cores implemented (1 = k = n, where r = n/k is an integer) 


perf(r) = sequential performance gain by using the resources equivalent to r cores 
to form a single processor, where perf(1) = 1. 


f = fraction of software that is parallelizable across multiple cores. 


Thus, if we construct a chip with 1 cores, we expect each core to provide sequential per- 
formance of 1 and for the n cores to be able to exploit parallelism up to a degree of n par- 
allel threads. Similarly, if the chip has k cores, then each core should exhibit a performance 
of perf(r) and the chip is able to exploit parallelism up to a degree of k parallel threads. 
We can modify Amdhal’s law (Equation 18.1) to reflect this situation as follows: 


1 
tae a {Xr 
perf(r) — perf(r) Xn 


a. Justify this modification of Amdahl’s law. 

b. Using Pollack’s rule, we set perf(r) = Vr. Let n = 16. We want to plot speedup 
as a function of r for f = 0.5; f = 0.9; f = 0.975; f = 0.99; f = 0.999. The results 
are available in a document at this book’s Web site (multicore-performance.pdf). 
What conclusions can you draw? 

c. Repeat part (b) for n = 256. 


Speedup = 
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18.2 


The technical reference manual for the ARM11 MPCore says that the Distributed In- 
terrupt Controller is memory mapped. That is, the core processors use memory 
mapped I/O to communicate with the DIC. Recall from Chapter 7 that with memory- 
mapped I/O, there is a single address space for memory locations and I/O devices. The 
processor treats the status and data registers of I/O modules as memory locations and 
uses the same machine instructions to access both memory and I/O devices. Based on 
this information, what path through the block diagram of Figure 18.11 is used for the 
core processors to communicate with the DIC? 
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Many instructors believe that research or implementation projects are crucial to the clear 
understanding of the concepts of computer organization and architecture. Without pro- 
jects, it may be difficult for students to grasp some of the basic concepts and interactions 
among components. Projects reinforce the concepts introduced in the book, give students 
a greater appreciation of the inner workings of processors and computer systems, and can 
motivate students and give them confidence that they have mastered the material. 


In this text, I have tried to present the concepts of computer organization and 


architecture as clearly as possible and have provided numerous homework prob- 
lems to reinforce those concepts. Many instructors will wish to supplement this ma- 
terial with projects. This appendix provides some guidance in that regard and 
describes support material available in the instructor’s manual. The support mater- 
ial covers six types of projects and other student exercises: 


Interactive simulations 
Research projects 
Simulation projects 
Assembly language projects 
Reading/report assignments 
Writing assignments 

Test bank 


A.1 INTERACTIVE SIMULATIONS 


New to this edition is the incorporation of interactive simulations. These simulations 
provide a powerful tool for understanding the complex design features of a modern 
computer system. Today’s students want to be able to visualize the various complex com- 
puter systems mechanisms on their own computer screen. A total of 20 simulations are 
used to illustrate key functions and algorithms in computer organization and architec- 
ture design. Table A.1 lists the simulations by chapter. At the relevant point in the book, 
an icon indicates that a relevant interactive simulation is available online for student use. 


Because the simulations enable the user to set initial conditions, they can serve 


as the basis for student assignments. The Instructor’s Resource Center (IRC) for this 
book includes a set of assignments, one set for each of the interactive simulations. 
Each assignment includes a several specific problems that can be assigned to students. 


The interactive simulations were developed under the direction of Professor 


Israel Koren, at the University of Massachusetts Department of Electrical and 
Computer Engineering. Aswin Sreedhar of the University of Massachusetts devel- 
oped the interactive simulation assignments. 


A.2 RESEARCH PROJECTS 


An effective way of reinforcing basic concepts from the course and for teaching stu- 
dents research skills is to assign a research project. Such a project could involve a lit- 
erature search as well as a Web search of vendor products, research lab activities, and 
standardization efforts. Projects could be assigned to teams or, for smaller projects, to 


Table A.1 


Computer Organization and Architecture — Interactive Simulations by Chapter 


Chapter 4— Cache Memory 


Cache Simulator 


Emulates small sized caches based on a user-input cache model and 
displays the cache contents at the end of the simulation cycle based 
on an input sequence which is entered by the user, or randomly 
generated if so selected. 


Cache Time Analysis 


Demonstrates Average Memory Access Time analysis for the cache 
parameters you specify. 


Multitask Cache Demonstrator 


Models cache on a system that supports multitasking. 


Selective Victim Cache Simulator 


Compares three different cache policies. 


Chapter 5— Internal Memory 


Interleaved Memory Simulator 


Demonstrates the effect of interleaving memory. 


Chapter 6— External Memory 


RAID 


Determine storage efficiency and reliability. 


Chapter 7— Input/Output 


1/O System Design Tool 


Evaluates comparative cost and performance of different I/O 
systems. 


Chapter 8— OS Support 


Page Replacement Algorithms 


Compares LRU, FIFO, and Optimal. 


More Page Replacement Algorithms 


Compares a number of policies. 


Chapter 12—CPU Structure and Function 


Reservation Table Analyzer 


Evaluates reservation tables. which are a way of representing the 
task flow pattern of a pipelined system. 


Branch Prediction 


Branch Target Buffer 


Demonstrates three different branch prediction schemes. 


Combined branch predictor/branch target buffer simulator. 


Chapter 13— Reduced Instruction Set Computers 


MIPS 5-Stage Pipeline 


Simulates the pipeline. 


Loop Unrolling 


Simulates the loop unrolling software technique for exploiting 
instruction-level parallelism. 


Chapter 14—Instruction-Level Parallelism and Superscalar Processors 


Pipeline with Static vs. Dynamic 
Scheduling 


A more complex simulation of the MIPS pipeline. 


Reorder Buffer Simulator 


Simulates instruction reordering in a RISC pipeline. 


Scoreboarding Technique for Dynamic 
Scheduling 


Simulation of an instruction scheduling technique used in a number 
of processors. 


Tomasulo’s Algorithm 


Simulation of another instruction scheduling technique. 


Alternative Simulation of Tomasulo’s 
Algorithm 


Another simulation of Tomasulo’s algorithm. 


Chapter 17— Parallel Processing 


Vector Processor Simulation 


Demonstrates execution of vector processing instructions. 


individuals. In any case, it is best to require some sort of project proposal early in the 
term, giving the instructor time to evaluate the proposal for appropriate topic and 
appropriate level of effort. Student handouts for research projects should include 

e A format for the proposal 

e A format for the final report 
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e A schedule with intermediate and final deadlines 
e A list of possible project topics 


The students can select one of the listed topics or devise their own comparable 
project. The IRC includes a suggested format for the proposal and final report as 
well as a list of possible research topics. 


A.3 SIMULATION PROJECTS 


An excellent way to obtain a grasp of the internal operation of a processor and to 
study and appreciate some of the design trade-offs and performance implications is 
by simulating key elements of the processor. Two useful tools that are useful for this 
purpose are SimpleScalar and SMPCache. 

Compared with actual hardware implementation, simulation provides two ad- 
vantages for both research and educational use: 


e With simulation, it is easy to modify various elements of an organization, to 
vary the performance characteristics of various components, and then to ana- 
lyze the effects of such modifications. 


e Simulation provides for detailed performance statistics collection, which can 
be used to understand performance trade-offs. 


SimpleScalar 


SimpleScalar [BURG97, MANJ01a, MANJO1b] is a set of tools that can be used 
to simulate real programs on a range of modern processors and systems. The tool 
set includes compiler, assembler, linker, and simulation and visualization tools. 
SimpleScalar provides processor simulators that range from an extremely fast 
functional simulator to a detailed out-of-order issue, superscalar processor simu- 
lator that supports nonblocking caches and speculative execution. The instruction 
set architecture and organizational parameters may be modified to create a vari- 
ety of experiments. 

The IRC for this book includes a concise introduction to SimpleScalar for 
students, with instructions on how to load and get started with SimpleScalar. The 
manual also includes some suggested project assignments. 

SimpleScalar is a portable software package the runs on most UNIX plat- 
forms. The SimpleScalar software can be downloaded from the SimpleScalar Web 
site. It is available at no cost for noncommercial use. 


SMPCache 


SMPCache is a trace-driven simulator for the analysis and teaching of cache memory 
systems on symmetric multiprocessors [RODRO1]. The simulation is based on a model 
built according to the architectural basic principles of these systems. The simulator has 
a full graphic and friendly interface. Some of the parameters that they can be studied 
with the simulator are: program locality; influence of the number of processors, cache 
coherence protocols, schemes for bus arbitration, mapping, replacement policies, cache 
size (blocks in cache), number of cache sets (for set associative caches), number of 
words by block (memory block size). 
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The IRC for this book includes a concise introduction to SMPCache for stu- 
dents, with instructions on how to load and get started with SMPCache. The manual 
also includes some suggested project assignments. 

SMPCache is a portable software package the runs on PC systems with Win- 
dows. The SMPCache software can be downloaded from the SMPCache Web site. It 
is available at no cost for noncommercial use. 


A.4 ASSEMBLY LANGUAGE PROJECTS 


Assembly language programming is often used to teach students low-level hardware 
components and computer architecture basics. CodeBlue is a simplified assembly lan- 
guage program developed at the U.S. Air Force Academy. The goal of the work was to 
develop and teach assembly language concepts using a visual simulator that students 
can learn in a single class. The developers also wanted students to find the language 
motivational and fun to use. The CodeBlue language is much simpler than most simpli- 
fied architecture instruction sets such as the $C123. Still it allows students to develop 
interesting assembly level programs that compete in tournaments, similar to the far 
more complex SPIMbot simulator. Most important, through CodeBlue programming, 
students learn fundamental computer architecture concepts such as instructions and 
data co-residence in memory, control structure implementation, and addressing modes. 

To provide a basis for projects, the developers have built a visual development 
environment that allows students to create a program, see its representation in 
memory, step through the program’s execution, and simulate a battle of competing 
programs in a visual memory environment. 

Projects can be built around the concept of a Core War tournament. Core War 
is a programming game introduced to the public in the early 1980s, which was popu- 
lar for a period of 15 years or so. Core War has four main components: a memory 
array of 8000 addresses, a simplified assembly language Redcode, an executive pro- 
gram called MARS (an acronym for Memory Array Redcode Simulator) and the 
set of contending battle programs. Two battle programs are entered into the mem- 
ory array at randomly chosen positions; neither program knows where the other one 
is. MARS executes the programs in a simple version of time-sharing. The two pro- 
grams take turns: a single instruction of the first program is executed, then a single 
instruction of the second, and so on. What a battle program does during the execu- 
tion cycles allotted to it is entirely up to the programmer. The aim is to destroy the 
other program by ruining its instructions. The CodeBlue environment substitutes 
CodeBlue for Redcode and provides its own interactive execution interface. 

The IRC includes the CodeBlue environment, a user’s manual for students, 
other supporting material, and suggested assignments. 


A.5 READING/REPORT ASSIGNMENTS 


Another excellent way to reinforce concepts from the course and to give students 
research experience is to assign papers from the literature to be read and analyzed. 
The IRC site includes a suggested list of papers to be assigned, organized by chapter. 
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The IRC provides a copy of each of the papers. The IRC also includes a suggested 
assignment wording. 


A.6 WRITING ASSIGNMENTS 


Writing assignments can have a powerful multiplier effect in the learning process in 
a technical discipline such as data communications and networking. Adherents of 
the Writing Across the Curriculum (WAC) movement (http://wac.colostate.edu/) re- 
port substantial benefits of writing assignments in facilitating learning. Writing as- 
signments lead to more detailed and complete thinking about a particular topic. In 
addition, writing assignments help to overcome the tendency of students to pursue a 
subject with a minimum of personal engagement, just learning facts and problem- 
solving techniques without obtaining a deep understanding of the subject matter. 

The IRC contains a number of suggested writing assignments, organized by 
chapter. Instructors may ultimately find that this is the most important part of their 
approach to teaching the material. I would greatly appreciate any feedback on this 
area and any suggestions for additional writing assignments. 


A.7 TEST BANK 


A test bank for the book is available at the IRC site for this book. For each chapter, 
the test bank includes true/false, multiple choice, and fill-in-the-blank questions. The 
test bank is an effective way to assess student comprehension of the material. 
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KEY POINTS 


® Anassembly language is a symbolic representation of the machine language 
of a specific processor, augmented by additional types of statements that 
facilitate program writing and that provide instructions to the assembler. 

® Anassembler is a program that translates assembly language into machine 
code. 

® The first step in the creation of an active process is to load a program into 
main memory and create a process image. 


® A linker is used to resolve any references between loaded modules. 


The topic of assembly language was briefly introduced in Chapter 11. This appendix 
provides more detail and also covers a number of related topics. There are a number of 
reasons why it is worthwhile to study assembly language programming (as compared 
with programming in a higher-level language), including the following: 


1. It clarifies the execution of instructions. 


. It shows how data is represented in memory. 


w NM 


. It shows how a program interacts with the operating system, processor, and the 
1/O system. 


4. It clarifies how a program accesses external devices. 


5. Understanding assembly language programmers makes students better high- 
level language (HLL) programmers, by giving them a better idea of the target 
language that the HLL must be translated into. 


We begin this chapter with a study of the basic elements of an assembly lan- 
guage, using the x86 architecture for our examples.! Next, we look at the operation 
of the assembler. This is followed by a discussion of linkers and loaders. 

Table B.1 defines some of the key terms used in this appendix. 


B.1 ASSEMBLY LANGUAGE 


Assembly language is a programming language that is one step away from machine 
language. Typically, each assembly language instruction is translated into one ma- 
chine instruction by the assembler. Assembly language is hardware dependent, with 
a different assembly language for each type of processor. In particular, assembly 
language instructions can make reference to specific registers in the processor, in- 
clude all of the opcodes of the processor, and reflect the bit length of the various 


'There are a number of assemblers for the x86 architecture. Our examples use NASM (Netwide 
Assembler), an open source assembler. A copy of the NASM manual is at this book’s Web site. 
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Table B.1 Key Terms for this Appendix 


Assembler 


A program that translates assembly language into machine code. 


Assembly Language 

A symbolic representation of the machine language of a specific processor, augmented by additional 
types of statements that facilitate program writing and that provide instructions to the assembler. 
Compiler 


A program that converts another program from some source language (or programming language) 
to machine language (object code). Some compilers output assembly language which is then con- 
verted to machine language by a separate assembler. A compiler is distinguished from an assembler 
by the fact that each input statement does not, in general, correspond to a single machine instruction 
or fixed sequence of instructions. A compiler may support such features as automatic allocation of 
variables, arbitrary arithmetic expressions, control structures such as FOR and WHILE loops, vari- 
able scope, input/ouput operations, higher-order functions and portability of source code. 


Executable Code 


The machine code generated by a source code language processor such as an assembler or compiler. 
This is software in a form that can be run in the computer. 


Instruction Set 


The collection of all possible instructions for a particular computer; that is, the collection of machine 
language instructions that a particular processor understands. 


Linker 


A utility program that combines one or more files containing object code from separately compiled 
program modules into a single file containing loadable or executable code. 


Loader 


A program routine that copies an executable program into memory for execution. 


Machine Language, or Machine Code 


The binary representation of a computer program which is actually read and interpreted by the 
computer. A program in machine code consists of a sequence of machine instructions (possibly 
interspersed with data). Instructions are binary strings which may be either all the same size 
(e.g., one 32-bit word for many modern RISC microprocessors) or of different sizes. 


Object Code 


The machine language representation of programming source code. Object code is created by a 
compiler or assembler and is then turned into executable code by the linker. 


registers of the processor and operands of the machine language. An assembly lan- 
guage programmer must therefore understand the computer’s architecture. 

Programmers rarely use assembly language for applications or even systems 
programs. HLLs provide an expressive power and conciseness that greatly eases the 
programmer’s tasks. The disadvantages of using an assembly language rather than 
an HLL include the following [FOG08a]: 


1. Development time. Writing code in assembly language takes much longer 
than writing in a high-level language. 


2. Reliability and security. It is easy to make errors in assembly code. The assembler 
is not checking if the calling conventions and register save conventions are 
obeyed. Nobody is checking for you if the number of PUSH and POP instructions 
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mn 


N 


is the same in all possible branches and paths. There are so many possibilities for 
hidden errors in assembly code that it affects the reliability and security of the pro- 
ject unless you have a very systematic approach to testing and verifying. 


. Debugging and verifying. Assembly code is more difficult to debug and verify 


because there are more possibilities for errors than in high-level code. 


. Maintainability. Assembly code is more difficult to modify and maintain be- 


cause the language allows unstructured spaghetti code and all kinds of tricks 
that are difficult for others to understand. Thorough documentation and a 
consistent programming style are needed. 


. Portability. Assembly code is platform-specific. Porting to a different platform 


is difficult. 


. System code can use intrinsic functions instead of assembly. The best modern 


C++ compilers have intrinsic functions for accessing system control registers 
and other system instructions. Assembly code is no longer needed for device 
drivers and other system code when intrinsic functions are available. 


. Application code can use intrinsic functions or vector classes instead of assembly. 


The best modern C++ compilers have intrinsic functions for vector operations 
and other special instructions that previously required assembly programming. 


. Compilers have been improved a lot in recent years. The best compilers are 


now quite good. It takes a lot of expertise and experience to optimize better 
than the best C++ compiler. 


Yet there are still some advantages to the occasional use of assembly language, 


including the following [FOGO8a]: 


1. 


Debugging and verifying. Looking at compiler-generated assembly code or 
the disassembly window in a debugger is useful for finding errors and for 
checking how well a compiler optimizes a particular piece of code. 


. Making compilers. Understanding assembly coding techniques is necessary for 


making compilers, debuggers and other development tools. 


. Embedded systems. Small embedded systems have fewer resources than PCs 


and mainframes. Assembly programming can be necessary for optimizing code 
for speed or size in small embedded systems. 


. Hardware drivers and system code. Accessing hardware, system control regis- 


ters etc. may sometimes be difficult or impossible with high level code. 


. Accessing instructions that are not accessible from high-level language. 


Certain assembly instructions have no high-level language equivalent. 


. Self-modifying code. Self-modifying code is generally not profitable because it 


interferes with efficient code caching. It may, however, be advantageous, for 
example, to include a small compiler in math programs where a user-defined 
function has to be calculated many times. 


. Optimizing code for size. Storage space and memory is so cheap nowadays that it 


is not worth the effort to use assembly language for reducing code size. However, 
cache size is still such a critical resource that it may be useful in some cases to op- 
timize a critical piece of code for size in order to make it fit into the code cache. 
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8. Optimizing code for speed. Modern C++ compilers generally optimize code 
quite well in most cases. But there are still cases where compilers perform 
poorly and where dramatic increases in speed can be achieved by careful as- 
sembly programming. 


9. Function libraries. The total benefit of optimizing code is higher in function li- 
braries that are used by many programmers. 


10. Making function libraries compatible with multiple compilers and operating 
systems. It is possible to make library functions with multiple entries that are 
compatible with different compilers and different operating systems. This re- 
quires assembly programming. 


The terms assembly language and machine language are sometimes, erro- 
neously, used synonymously. Machine language consists of instructions directly exe- 
cutable by the processor. Each machine language instruction is a binary string 
containing an opcode, operand references, and perhaps other bits related to execu- 
tion, such as flags. For convenience, instead of writing an instruction as a bit string, it 
can be written symbolically, with names for opcodes and registers. An assembly lan- 
guage makes much greater use of symbolic names, including assigning names to spe- 
cific main memory locations and specific instruction locations. Assembly language 
also includes statements that are not directly executable but serve as instructions to 
the assembler that produces machine code from an assembly language program. 


Assembly Language Elements 


A statement in a typical assembly language has the form shown in Figure B.1. It con- 
sists of four elements: label, mnemonic, operand, and comment. 


LABEL If a label is present, the assembler defines the label as equivalent to the address 

into which the first byte of the object code generated for that instruction will be loaded. 

The programmer may subsequently use the label as an address or as data in another in- 

struction’s address field. The assembler replaces the label with the assigned value when 

creating an object program. Labels are most frequently used in branch instructions. 
As an example, here is a program fragment: 


L2: SUB EAX, EDX ; subtract contents of register EDX from 
; contents of EAX and store result in EAX 


JG L2 ; jump to L2 if result of subtraction is 
; positive 


Label Mnemonic Operand(s) ;comment 
yo XR a a a) Ra ~~  f R a 7) 
Optional Opcode name Zero or more Optional 
or 
directive name 
or 
macro name 


Figure B.1 Assembly-Language Statement Structure 
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The program will continue to loop back to location L2 until the result is zero 
or negative. Thus, when the jg instruction is executed, if the result is positive, the 
processor places the address equivalent to the label L2 in the program counter. 

Reasons for using a label include the following; 


1. A label makes a program location easier to find and remember. 


2. The label can easily be moved to correct a program. The assembler will auto- 
matically change the address in all instructions that use the label when the pro- 
gram is reassembled. 


3. The programmer does not have to calculate relative or absolute memory 
addresses, but just uses labels as needed. 


MNEMONIC The mnemonic is the name of the operation or function of the assem- 
bly language statement. As discussed subsequently, a statement can correspond to a 
machine instruction, an assembler directive, or a macro. In the case of a machine 
instruction, a mnemonic is the symbolic name associated with a particular opcode. 

Table 10.8 lists the mnemonic, or instruction name, of many of the x86 instruc- 
tions. Appendix A of [CART06] lists the x86 instructions, together with the operands 
for each and the effect of the instruction on the condition codes. Appendix B of the 
NASM manual provides a more detailed description of each x86 instruction. Both 
documents are available at this book’s Web site. 


OPERAND(S) An assembly language statement includes zero or more operands. Each 
operand identifies an immediate value, a register value, or a memory location. Typically, 
the assembly language provides conventions for distinguishing among the three types 
of operand references, as well as conventions for indicating addressing mode. 

For the x86 architecture, an assembly language statement may refer to a regis- 
ter operand by name. Figure B.2 illustrates the general-purpose x86 registers, with 
their symbolic name and their bit encoding. The assembler will translate the sym- 
bolic name into the binary identifier for the register. 

As discussed in Section 11.2, the x86 architecture has a rich set of addressing 
modes, each of which must be expressed symbolically in the assembly language. Here 
we cite a few of the common examples. For register addressing, the name of the regis- 
ter is used in the instruction. For example, MOV ECX, EBX copies the contents of reg- 
ister EBX into register ECX. Immediate addressing indicates that the value is 
encoded in the instruction. For example, MOV EAX, 100H copies the hexadecimal 
value 100 into register EAX. The immediate value can be expressed as a binary num- 
ber with the suffix B or a decimal number with no suffix. Thus, equivalent statements 
to the preceding one are MOV EAX, 100000000B and MOV EAX, 256. Direct 
addressing refers to a memory location and is expressed as a displacement from the 
DS segment register. This is best explained by example. Assume that the 16-bit data 
segment register DS contains the value 1000H. Then the following sequence occurs: 


T, 


MOV AX, 1234H 
MOV [3518H], AX 


First the 16-bit register AX is initialized to 1234H. Then, in line two, the 
contents of AX are moved to the logical address DS:3518H. This address is formed 
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General-purpose registers 


31 0 16-bit 32-bit 
AH AL AX EAX (000) 
BH BL BX EBX (011) 
CH CL CX ECX (001) 
DH DL DX _ = EDX (010) 
ESI (110) 
EDI (111) 
EBP (101) 
ESP (100) 


Segment registers 


15 0 


CS 


DS 


SS 


ES 


FS 


GS 


Figure B.2_ Intel x86 Program Execution Registers 


by shifting the contents of DS left 4 bits and adding 3518H to form the 32-bit logi- 
cal address 13518H. 


COMMENT All assembly languages allow the placement of comments in the pro- 
gram. A comment can either occur at the right-hand end of an assembly statement 
or can occupy an entire text line. In either case, the comment begins with a special 
character that signals to the assembler that the rest of the line is a comment and is to 
be ignored by the assembler. Typically, assembly languages for the x86 architecture 
use a semicolon (;) for the special character. 


Type of Assembly Language Statements 


Assembly language statements are one of four types: instruction, directive, macro 
definition, and comment. A comment statement is simply a statement that consists 
entirely of a comment. The remaining types are briefly described in this section. 


INSTRUCTIONS The bulk of the noncomment statements in an assembly language 
program are symbolic representations of machine language instructions. Almost in- 
variably, there is a one-to-one relationship between an assembly language instruction 
and a machine instruction. The assembler resolves any symbolic references and 
translates the assembly language instruction into the binary string that comprises the 
machine instruction. 
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DIRECTIVES Directives, also called pseudo-instructions, are assembly language 
statements that are not directly translated into machine language instructions. In- 
stead, directives are instruction to the assembler to perform specified actions doing 
the assembly process. Examples include the following: 

e Define constants 

e Designate areas of memory for data storage 

e Initialize areas of memory 

e Place tables or other fixed data in memory 

e Allow references to other programs 


Table B.2 lists some of the NASM directives. As an example, consider the fol- 
lowing sequence of statements: 


L2 DB “A” ; byte initialized to ASC code for A (65) 
MOV AL, [L1] ; copy byte at L1 into AL 
MOV EAX, Ll ; store address of byte at L1 in EAX 
MOV [L1], AH ; copy contents of AH into byte at Ll 


Table B.2 Some NASM Assembly-Language Directives 
(a) Letters for RESx and Dx Directives 


Unit Letter 


byte 
word (2 bytes) 


double word (4 bytes) 
quad word (8 bytes) 


ten bytes 


(b) Directives 


Name Description Example 


DB, DW, Initalize locations L6 DD 1A92H 
DD, DQ, ; doubleword at L6 initialized to 1A92H 
DT 


RESB, Reserve BUFFER RESB 64 

RESW, uniitialized ; reserve 64 bytes starting at BUFFER 
RESD, locations 
RESQ, 
REST 


INCBIN Include binary file INCBIN “file.dat” ; include this file 
in output 


EQU Define a symbol to MSGLEN EQU 25 
a given constant value ; the constant MSGLEN equals decimal 25 


TIMES Repeat instruction ZEROBUF TIMES 64 DB 0 
multiple times ; initialize 64-byte buffer to all zeros 
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If a plain label is used, it is interpreted as the address (or offset) of the 
data. If the label is placed inside square brackets, it is interpreted as the data at 
the address. 


MACRO DEFINITIONS A macro definition is similar to a subroutine in several 
ways. A subroutine is a section of a program that is written once, and can be used 
multiple times by calling the subroutine from any point in the program. When a 
program is compiled or assembled, the subroutine is loaded only once. A call to 
the subroutine transfers control to the subroutine and a return instruction in the 
subroutine returns control to the point of the call. Similarly, a macro definition is 
a section of code that the programmer writes once, and then can use many times. 
The main difference is that when the assembler encounters a macro call, it re- 
places the macro call with the macro itself. This process is called macro expansion. 
So, if a macro is defined in an assembly language program and invoked 10 times, 
then 10 instances of the macro will appear in the assembled code. In essence, sub- 
routines are handled by the hardware at run time, whereas macros are handled by 
the assembler at assembly time. Macros provide the same advantage as subrou- 
tines in terms of modular programming, but without the runtime overhead of a 
subroutine call and return. The tradeoff is that the macro approach uses more 
space in the object code. 

In NASM and many other assemblers, a distinction is made between a single- 
line macro and a multi-line macro. In NASM, single-line macros are defined using 
the % DEFINE directive. Here is an example in which multiple single-line macros 
are expanded. First, we define two macros: 


2*X 
1 + B(X) 


% DEFINE B(X) 
% DEFINE A(X) 


At some point in the assembly language program, the following statement appears: 
MOV AX, A(8) 

The assembler expands this statement to: 
MOV AX, 1+2*8 


which assembles to a machine instruction to move the immediate value 17 to 
register AX. 

Multiline macros are defined using the mnemonic &MACRO. Here is an ex- 
ample of a multiline macro definition: 


SMACRO PROLOGUE 1 
PUSH EBP ; push contents of EBP onto stack 


; pointed to by ESP and 


; decrement contents of ESP by 4 
MOV EBP, ESP ; copy contents of ESP to EBP 


ie) 
1p) 
ae) 


SUB ESP, %1 ; subtract first parameter value from ] 
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The number 1 after the macro name in the SMACRO line defines the number of 
parameters the macro expects to receive. The use of %1 inside the macro definition 
refers to the first parameter to the macro call. 

The macro call 


MYFUNC: PROLOGUE 12 
expands to the following lines of code: 


MYFUNC: PUSH EBP 
MOV EBP, ESP 
SUB ESP, 12 


Example: Greatest Common Divisor Program 


As an example of the use of assembly language, we look at a program to compute 
the greatest common divisor of two integers. We define the greatest common divisor 
of the integers a and b as follows: 


gcd(a, b) = max[k, such that k divides a and k divides b] 


where we say that k divides a if there is no remainder. Euclid’s algorithm for the 
greatest common divisor is based on the following theorem. For any nonnegative 
integers a and integer b, 


gcd(a, b) = gced(b, a mod b) 
Here is a C language program that implements Euclid’s algorithm: 


unsigned int gcd (unsigned int a, unsigned int b) 


{ 
if (a == 0 && b == 0) 
bees 
else if (b == 0) 
b=a; 
else if (a != 0) 
while (a != b) 
if (a <b) 
b -= a; 
else 
a -= b; 
return b; 


Figure B.3 shows two assembly language versions of the preceding program. 
The program on the left was done by a C compiler; the program on the right was 
programmed by hand. The latter program uses a number of programmert’s tricks to 
produce a tighter, more efficient implementation. 
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gcd: mov ebx, eax @excls neg eax 
mov eax, edx je L3 
test ebx, ebx itis neg eax 
jne itil, xchg eax, edx 
test edx, edx G23 sub eax, edx 
jne inl jg L2 
mov eax,1 jne Ill 
ret Ib S}8 add eax, edx 
lisclies test eax, eax jne L4 
jne L2 Sates eax 
mov eax, ebx L4: ret 
Tee 
iby 3 test ebx, ebx 
je in5) 
L3; cmp ebx, eax 
je is 
jae L4 
sub eax, ebx 
jmp L3 
L4: sub ebx, eax 
jmp L3 
ns et 
(a) Compiled program (b) Written directly in assembly language 


Figure B.3 Assembly Programs for Greatest Common Divisor 


B.2,> ASSEMBLERS 


The assembler is a software utility that takes an assembly program as input and pro- 
duces object code as output. The object code is a binary file. The assembler views 
this file as a block of memory starting at relative location 0. 

There are two general approaches to assemblers: the two-pass assembler and 
the one-pass assembler. 


Two-Pass Assembler 


We look first at the two-pass assembler, which is more common and somewhat easier 
to understand. The assembler makes two passes through the source code (Figure B.4): 


FIRST PASS In the first pass, the assembler is only concerned with label definitions. The 
first pass is used to construct a symbol table that contains a list of all labels and their as- 
sociated location counter (LC) values. The first byte of the object code will have the LC 
value of 0. The first pass examines each assembly statement. Although the assembler is 
not yet ready to translate instructions, it must examine each instruction sufficiently to 
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Read line Read next line from 
from source —() intermediate file 


file 


Close source 
file and rewind 
intermediate file 


(rss?) Assemble 
instruction 


Store name and 
value in symbol table 


label 
defined? 


Determine Write ob, ject instruction 
size of into object file 
instruction 
LC =LC + size Write source & object 
lines into listing file 


Write source line 
& other info on 
intermediate file 


Figure B.4 Flowchart of Two-Pass Assembler 


determine the length of the corresponding machine instruction and therefore how 
much to increment the LC. This may require not only examining the opcode but also 
looking at the operands and the addressing modes. 

Directives such as DO and REST (see Table B.2) cause the location counter to 
be adjusted according to how much storage is specified. 

When assembler encounters a statement with a label, it places the label into 
the symbol table, along with the current LC value. The assembler continues until it 
has read all of the assembly language statements. 
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Always Update 
condition condition Fero 
code flags rotation 


ADDS r3, r3, #19|/1 1 1 0/0 01/0 010;/1)0011/0011j0000j00010011 


Data processin 
; pros g aril instr 
immediate format format 
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14131211109 8 765 432410 


opcode |S Rn Rd rotate immediate 


Figure B.S Translating an ARM Assembly Instruction into a Binary Machine Instruction 


SECOND PASS The second pass reads the program again from the beginning. Each 
instruction is translated into the appropriate binary machine code. Translation in- 
cludes the following operations: 


1. Translate the mnemonic into a binary opcode. 


2. Use the opcode to determine the format of the instruction and the location 
and length of the various fields in the instruction. 


3. Translate each operand name into the appropriate register or memory code. 
4. Translate each immediate value into a binary string. 
5 


. Translate any references to labels into the appropriate LC value using the 
symbol table. 


6. Set any other bits in the instruction that are needed, including addressing 
mode indicators, condition code bits, and so on. 


A simple example, using the ARM assembly language, is shown in Figure B.5. 
The ARM assembly language instruction ADDS 13, 13, #19 is translated in to the bi- 
nary machine instruction 1110 0010 0101 0011 0011 0000 0001 0011. 


ZEROTH PASS Most assembly language includes the ability to define macros. When 
macros are present there is an additional pass that the assembler must make before 
the first pass. Typically, the assembly language requires that all macro definitions 
must appear at the beginning of the program. 

The assembler begins this “zeroth pass” by reading all macro definitions. 
Once all the macros are recognized, the assembler goes through the source code 
and expands the macros with their associated parameters whenever a macro call 
is encountered. The macro processing pass generates a new version of the source 
code with all of the macro expansions in place and all of the macro definitions 
removed. 


One-Pass Assembler 


It is possible to implement an assembler that makes only a single pass through the 
source code (not counting the macro processing pass). The main difficulty in trying 
to assemble a program in one pass involves forward references to labels. Instruction 
operands may be symbols that have not yet been defined in the source program. 
Therefore, the assembler does not know what relative address to insert in the trans- 
lated instruction. 
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In essence, the process of resolving forward references works as follows. When 
the assembler encounters an instruction operand that is a symbol that is not yet de- 
fined, the assembler does the following: 


1. It leaves the instruction operand field empty (all zeros) in the assembled bi- 
nary instruction. 


2. The symbol used as an operand is entered in the symbol table. The table entry 
is flagged to indicate that the symbol is undefined. 


3. The address of the operand field in the instruction that refers to the undefined 
symbol is added to a list of forward references associated with the symbol 
table entry. 


When the symbol definition is encountered so that a LC value can be associated 
with it, the assembler inserts the LC value in the appropriate entry in the symbol 
table. If there is a forward reference list associated with the symbol, then the assem- 
bler inserts the proper address into any instruction previously generated that is on 
the forward reference list. 


Example: Prime Number Program 


We now look at an example that includes directives. This example looks at a pro- 
gram that finds prime numbers. Recall that prime numbers are evenly divisible by 
only 1 and themselves. There is no formula for doing this. The basic method this 
program uses is to find the factors of all odd numbers below a given limit. If no 
factor can be found for an odd number, it is prime. Figure B.6 shows the basic 
algorithm written in C. Figure B.7 shows the same algorithm written in NASM 
assembly language. 


unsigned guess; /* current guess for prime */ 
unsigned factor; /* possible factor of guess */ 
unsigned limit; /* find primes up to this value */ 


joneskinjene ((MiVsitays! jopeslin(Ssy Wis) ee) 9 7) R 


scanf(*%u", &limit); 


joneakioens (YD Aa) y /* treat first two primes as */ 
jonaakajene (OMS\\in*)) 9 /* special case */ 

guess = 5; /aeeienacinte-l cla Wes seme, 

while (guess < = limit) { ie Myeyolie ieee el ikeYelcione OE (enblereyse wy/ 


EAC OE m=aisy, 
while (factor * factor < guess && guess% factor != 0) 
TACEOnM 4 =—2% 
if (guess % factor != 0) 
jomestronone (MSN al, CnbISeIS})) y 


guess += 2; /* only look at odd numbers */ 


Figure B.6 C Program for Testing Primality 
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Sinclude “asm_io.inc” 
segment .data 
Message db “Find primes up to: ", 0 


segment .bss 
Limit resd 1 
Guess resd 1 


segment .text 

global _asm_main 
_asm_main: 

enter 0,0 

pusha 


mov eax, Message 


call print_string 
call read_int 
mov [Limit], eax 
mov eax, 2 

Galil jovestisie slyahe, 
Galil, joweatraye sail, 

mov eax, 3 

@ellil joyestiahe slahe, 
Celi joresiiaje sql 


mov dword [Guess], 5 
while_limit: 

mov eax, [Guess] 

cmp eax, [Limit] 

jnbe end_while_limit 


mov ebx, 3 
while_factor: 

mov eax, ebx 

mul eax 

jo end_while_factor 

cmp eax, [Guess] 

jnb end_while_factor 

mov eax, [Guess] 

mov edx,0 

div ebx 

cmp edx, 0 

je end_while_factor 


ayelel Glove, Pe ikeyelcene t= Be 

jmp while_factor 
end_while_factor: 

je end_if 

mov eax, [Guess] 

call print int 

@ailil joweatraye saul, 
end_if: 

add dword [Guess], 2 

jmp while_limit 
end_while_limit: 


popa 
mov eax, 0 
leave 

ret 


find primes up to this limit 
the current guess for prime 


setup routine 


scant (° Su", Lami t je 


jonealiayese (82 \ia\"”)) 9 


jonealaneie (si War) 9 


Guess = 5; 
while (Guess <= Limit) 


use jnbe since numbers are unsigned 


ebx is factor = 3; 
edx:eax = eax*eax 
if answer won’t fit in eax alone 


if !(factor*factor < guess) 


edx = edx:eax% ebx 


if !(guess% factor != 0) 


if !(guess% factor != 0) 
print (“s\n”) 


guess += 2 


return back) to Cc 


Figure B.7 Assembly Program for Testing Primality 
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B.3. LOADING AND LINKING 


The first step in the creation of an active process is to load a program into main memory 
and create a process image (Figure B.8). Figure B.9 depicts a scenario typical for most 
systems. The application consists of a number of compiled or assembled modules in 
object-code form. These are linked to resolve any references between modules. At the 
same time, references to library routines are resolved. The library routines themselves 
may be incorporated into the program or referenced as shared code that must be sup- 
plied by the operating system at run time. In this section, we summarize the key features 
of linkers and loaders. First, we discuss the concept of relocation. Then, for clarity in the 
presentation, we describe the loading task when a single program module is involved; no 
linking is required. We can then look at the linking and loading functions as a whole. 


Relocation 


In a multiprogramming system, the available main memory is generally shared 
among a number of processes. Typically, it is not possible for the programmer to 
know in advance which other programs will be resident in main memory at the time 
of execution of his or her program. In addition, we would like to be able to swap ac- 
tive processes in and out of main memory to maximize processor utilization by pro- 
viding a large pool of ready processes to execute. Once a program has been swapped 
out to disk, it would be quite limiting to declare that when it is next swapped back 
in, it must be placed in the same main memory region as before. Instead, we may 
need to relocate the process to a different area of memory. 


Process control block 
Program Program 
Data Data 
Object code 
Stack 


Process image in 
main memory 


Figure B.8 The Loading Function 
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Figure B.9 A Linking and Loading Scenario 


Main memory 


Thus, we cannot know ahead of time where a program will be placed, and we 
must allow that the program may be moved about in main memory due to swapping. 
These facts raise some technical concerns related to addressing, as illustrated in 
Figure B.10. The figure depicts a process image. For simplicity, let us assume that the 
process image occupies a contiguous region of main memory. Clearly, the operating 


Process control ——————————————> 


information Rutty point er Process control block 
to program 
Branch 
Program instruction 
Increasing 
address 
values 
Reference 
to data 
Data 
Current top —————>} 
of stack 
Stack 


Figure B.10 Addressing Requirements for a Process 
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system will need to know the location of process control information and of the ex- 
ecution stack, as well as the entry point to begin execution of the program for this 
process. Because the operating system is managing memory and is responsible for 
bringing this process into main memory, these addresses are easy to come by. In ad- 
dition, however, the processor must deal with memory references within the pro- 
gram. Branch instructions contain an address to reference the instruction to be 
executed next. Data reference instructions contain the address of the byte or word 
of data referenced. Somehow, the processor hardware and operating system soft- 
ware must be able to translate the memory references found in the code of the pro- 
gram into actual physical memory addresses, reflecting the current location of the 
program in main memory. 


Loading 


In Figure B.9, the loader places the load module in main memory starting at location 
x. In loading the program, the addressing requirement illustrated in Figure B.10 
must be satisfied. In general, three approaches can be taken: 


e Absolute loading 
e Relocatable loading 
e Dynamic run-time loading 


ABSOLUTE LOADING An absolute loader requires that a given load module always 
be loaded into the same location in main memory. Thus, in the load module presented 
to the loader, all address references must be to specific, or absolute, main memory 
addresses. For example, if x in Figure B.9 is location 1024, then the first word in a 
load module destined for that region of memory has address 1024. 

The assignment of specific address values to memory references within a 
program can be done either by the programmer or at compile or assembly time 
(Table B.3a). There are several disadvantages to the former approach. First, every 
programmer would have to know the intended assignment strategy for placing 
modules into main memory. Second, if any modifications are made to the program 
that involve insertions or deletions in the body of the module, then all of the 
addresses will have to be altered. Accordingly, it is preferable to allow memory 
references within programs to be expressed symbolically and then resolve those 
symbolic references at the time of compilation or assembly. This is illustrated in 
Figure B.11. Every reference to an instruction or item of data is initially repre- 
sented by a symbol. In preparing the module for input to an absolute loader, the 
assembler or compiler will convert all of these references to specific addresses 
(in this example, for a module to be loaded starting at location 1024), as shown in 
Figure B.11b. 


RELOCATABLE LOADING The disadvantage of binding memory references to spe- 
cific addresses prior to loading is that the resulting load module can only be placed 
in one region of main memory. However, when many programs share main memory, 
it may not be desirable to decide ahead of time into which region of memory a par- 
ticular module should be loaded. It is better to make that decision at load time. Thus 
we need a load module that can be located anywhere in main memory. 
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Table B.3 Address Binding 


(a) Loader 


Binding Time Function 


Programming time All actual physical addresses are directly specified by the programmer in the 
program itself. 

Compile or assembly time The program contains symbolic address references, and these are converted 
to actual physical addresses by the compiler or assembler. 


Load time The compiler or assembler produces relative addresses. The loader translates 
these to absolute addresses at the time of program loading. 


Run time The loaded program retains relative addresses. These are converted 
dynamically to absolute addresses by processor hardware. 


(b) Linker 


Linkage Time Function 


Programming time No external program or data references are allowed. The programmer must 
place into the program the source code for all subprograms that are 
referenced. 


Compile or assembly time The assembler must fetch the source code of every subroutine that is 
referenced and assemble them as a unit. 


Load module creation All object modules have been assembled using relative addresses. These 
modules are linked together and all references are restated relative to the 
origin of the final load module. 


Load time External references are not resolved until the load module is to be loaded 
into main memory. At that time, referenced dynamic link modules are 
appended to the load module, and the entire package is loaded into main 
or virtual memory. 


Run time External references are not resolved until the external call is executed by 
the processor. At that time, the process is interrupted and the desired 
module is linked to the calling program. 


To satisfy this new requirement, the assembler or compiler produces not actual 
main memory addresses (absolute addresses) but addresses that are relative to 
some known point, such as the start of the program. This technique is illustrated in 
Figure B.11c. The start of the load module is assigned the relative address 0, and all 
other memory references within the module are expressed relative to the beginning 
of the module. 

With all memory references expressed in relative format, it becomes a simple 
task for the loader to place the module in the desired location. If the module is to be 
loaded beginning at location x, then the loader must simply add x to each memory 
reference as it loads the module into memory. To assist in this task, the load module 
must include information that tells the loader where the address references are and 
how they are to be interpreted (usually relative to the program origin, but also pos- 
sibly relative to some other point in the program, such as the current location). This 
set of information is prepared by the compiler or assembler and is usually referred 
to as the relocation dictionary. 
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Symbolic Absolute Relative Main memory 
addresses addresses addresses addresses 
PROGRAM 1024| PROGRAM 0) PROGRAM x | PROGRAM 
JUMP X mi JUMP a | JUMP 400 mi JUMP 400 mi 
x 1424 400 400 +x 
LOAD Y LOAD 2224 LOAD 1200 LOAD 1200 
| DATA | DATA | os. i | DATA | 
Y 2224 1200 1200 +x 
(a) Object module (b) Absolute load module (c) Relative load module (d) Relative load module 


loaded into main memory 
starting at location x 


Figure B.11 Absolute and Relocatable Load Modules 


DYNAMIC RUN-TIME LOADING Relocatable loaders are common and provide obvi- 
ous benefits relative to absolute loaders. However, in a multiprogramming environ- 
ment, even one that does not depend on virtual memory, the relocatable loading 
scheme is inadequate. We have referred to the need to swap process images in and 
out of main memory to maximize the utilization of the processor. To maximize main 
memory utilization, we would like to be able to swap the process image back into dif- 
ferent locations at different times. Thus, a program, once loaded, may be swapped out 
to disk and then swapped back in at a different location. This would be impossible if 
memory references had been bound to absolute addresses at the initial load time. 

The alternative is to defer the calculation of an absolute address until it is ac- 
tually needed at run time. For this purpose, the load module is loaded into main 
memory with all memory references in relative form (Figure B.11c). It is not until an 
instruction is actually executed that the absolute address is calculated. To assure that 
this function does not degrade performance, it must be done by special processor 
hardware rather than software. This hardware is described in Chapter 8. 

Dynamic address calculation provides complete flexibility. A program can be 
loaded into any region of main memory. Subsequently, the execution of the program 
can be interrupted and the program can be swapped out of main memory, to be later 
swapped back in at a different location. 


Linking 


The function of a linker is to take as input a collection of object modules and pro- 
duce a load module, consisting of an integrated set of program and data modules, to 
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Return J (b) Load module 


(a) Object modules 


Figure B.12 The Linking Function 


be passed to the loader. In each object module, there may be address references to 
locations in other modules. Each such reference can only be expressed symbolically 
in an unlinked object module. The linker creates a single load module that is the 
contiguous joining of all of the object modules. Each intramodule reference must be 
changed from a symbolic address to a reference to a location within the overall load 
module. For example, module A in Figure B.12a contains a procedure invocation of 
module B. When these modules are combined in the load module, this symbolic ref- 
erence to module B is changed to a specific reference to the location of the entry 
point of B within the load module. 


LINKAGE EDITOR The nature of this address linkage will depend on the type of 
load module to be created and when the linkage occurs (Table B.3b). If, as is usually 
the case, a relocatable load module is desired, then linkage is usually done in the fol- 
lowing fashion. Each compiled or assembled object module is created with refer- 
ences relative to the beginning of the object module. All of these modules are put 
together into a single relocatable load module with all references relative to the ori- 
gin of the load module. This module can be used as input for relocatable loading or 
dynamic run-time loading. 

A linker that produces a relocatable load module is often referred to as a link- 
age editor. Figure B.12 illustrates the linkage editor function. 
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DYNAMIC LINKER As with loading, it is possible to defer some linkage functions. 
The term dynamic linking is used to refer to the practice of deferring the linkage of 
some external modules until after the load module has been created. Thus, the load 
module contains unresolved references to other programs. These references can be 
resolved either at load time or run time. 

For load-time dynamic linking (involving upper dynamic library in Figure 
B.9), the following steps occur. The load module (application module) to be loaded 
is read into memory. Any reference to an external module (target module) causes 
the loader to find the target module, load it, and alter the reference to a relative 
address in memory from the beginning of the application module. There are several 
advantages to this approach over what might be called static linking: 


e It becomes easier to incorporate changed or upgraded versions of the target 
module, which may be an operating system utility or some other general- 
purpose routine. With static linking, a change to such a supporting module 
would require the relinking of the entire application module. Not only is this 
inefficient, but it may be impossible in some circumstances. For example, in the 
personal computer field, most commercial software is released in load module 
form; source and object versions are not released. 


e Having target code in a dynamic link file paves the way for automatic code 
sharing. The operating system can recognize that more than one application is 
using the same target code because it loaded and linked that code. It can use 
that information to load a single copy of the target code and link it to both ap- 
plications, rather than having to load one copy for each application. 


e It becomes easier for independent software developers to extend the function- 
ality of a widely used operating system such as Linux. A developer can come 
up with a new function that may be useful to a variety of applications and 
package it as a dynamic link module. 


With run-time dynamic linking (involving lower dynamic library in Figure 
B.9), some of the linking is postponed until execution time. External references to 
target modules remain in the loaded program. When a call is made to the absent 
module, the operating system locates the module, loads it, and links it to the calling 
module. Such modules are typically shareable. In the Windows environment, these 
are call dynamic-link libraries (DLLs) Thus, if one process is already making use of 
a dynamically-linked shared module, then that module is in main memory and a new 
process can simply link to the already-loaded module. 

The use of DLLs can lead to a problem commonly referred to as DLL hell. 
DLL occurs if two or more processes are sharing a DLL module but expect different 
versions of the module. For example, an application or system function might be re- 
installed and bring in with it an older version of a DLL file. 

We have seen that dynamic loading allows an entire load module to be 
moved around; however, the structure of the module is static, being unchanged 
throughout the execution of the process and from one execution to the next. How- 
ever, in some cases, it is not possible to determine prior to execution which object 
modules will be required. This situation is typified by transaction-processing ap- 
plications, such as an airline reservation system or a banking application. The na- 
ture of the transaction dictates which program modules are required, and they are 
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loaded as appropriate and linked with the main program. The advantage of the use 
of such a dynamic linker is that it is not necessary to allocate memory for program 
units unless those units are referenced. This capability is used in support of seg- 
mentation systems. 

One additional refinement is possible: An application need not know the 
names of all the modules or entry points that may be called. For example, a charting 
program may be written to work with a variety of plotters, each of which is driven by 
a different driver package. The application can learn the name of the plotter that is 
currently installed on the system from another process or by looking it up in a con- 
figuration file. This allows the user of the application to install a new plotter that did 
not exist at the time the application was written. 


B.4. RECOMMENDED READING AND WEB SITES 


[SALO93] covers the design and implementation of assemblers and loaders. 

The topics of linking and loading are covered in many books on program development, 
computer architecture, and operating systems. A particularly detailed treatment is [BECK97]. 
[CLAR98] also contains a good discussion. A thorough practical discussion of this topic, with 
numerous OS examples, is [LEVI00]. 

[BARTO3] is an excellent treatment for learning assembly language for x86 processors; 
suitable for self-study. [CART06] covers assembly language for x86 machines. For the serious 
x86 programmer, [FOG08a] is highly useful. [KNAG04] is a thorough treatment of ARM 
assembly language. 


BARTO03 Bartlett, J. Programming from the Ground Up. 2003. Available at this book’s 
Web site. 


BECK97_ Beck, L. System Software. Reading, MA: Addison-Wesley, 1997. 
CART06_ Carter, P PC Assembly Language. July 23,2006. Available at this book’s Web site. 


CLAR98_ Clarke, D., and Merusi, D. System Software Programming: The Way Things 
Work. Upper Saddle River, NJ: Prentice Hall, 1998. 


FOG08a_ Fog, A. Optimizing Subroutines in Assembly Language: An Optimization Guide 


for x86 Platforms. Copenhagen University College of Engineering, 2008. http:// 
www.agner.org/optimize/ 

KNAG04_ = Knaggs, P.,and Welsh, S. ARM: Assembly Language Programming. Bournemouth 
University School of Design, Engineering & Computing. August 31, 2004. Available 
at this book’s Web site. 


LEVIO00 Levine, J. Linkers and Loaders. San Francisco: Morgan Kaufmann, 2000. 


SALO93 Salomon, D. Assemblers and Loaders. Ellis Horwood Ltd, 1993. Available at 
this book’s Web site. 


Recommended Web sites: 


e Gavin’s Guide to 80x86 Assembly: A good, concise overview of x86 assembler language. 
e The Art of Assembly Language Programming: A 1500-page online mega-book on the 
subject. Should be enough for any student of the subject. 
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B.5 KEY TERMS, REVIEW QUESTIONS, AND PROBLEMS 


Key Terms 


assembler 


label mnemonic 


assembly language linkage editor one-pass assembler 


comment 


directive 
dynamic linker 
instruction 


linking operand 

load-time dynamic linking relocation 

loading run-time dynamic linking 
macro two-pass assembler 


Review Questions 


B.1 
B.2 
B.3 
B.4 
B.5 
B.6 
B.7 


Probl 
B.1 


List some reasons why it is worthwhile to study assembly language programming. 
What is an assembly language? 

List some disadvantages of assembly language compared to high-level languages. 
List some advantages of assembly language compared to high-level languages. 
What are the typical elements of an assembly language statement. 

List and briefly define four different kinds of assembly language statements. 
What is the difference between a one-pass assembler and a two-pass assembler? 


ems 


Core War is a programming game introduced to the public in the early 1980s 
[DEWD84], which was popular for a period of 15 years or so. Core War has four main 
components: a memory array of 8000 addresses, a simplified assembly language Red- 
code, an executive program called MARS (an acronym for Memory Array Redcode 
Simulator) and the set of contending battle programs. Two battle programs are entered 
into the memory array at randomly chosen positions; neither program knows where 
the other one is. MARS executes the programs in a simple version of time-sharing. The 
two programs take turns: a single instruction of the first program is executed, then a 
single instruction of the second, and so on. What a battle program does during the ex- 
ecution cycles allotted to it is entirely up to the programmer. The aim is to destroy the 
other program by ruining its instructions. In this problem and the next several, we use 
an even simpler language, called CodeBlue, to explore some Core War concepts. 
CodeBlue contains only five assembly language statements and uses three ad- 
dressing modes (Table B.4). Addresses wrap around, so that for the last location in 
memory, the relative address of +1 refers to the first location in memory. For exam- 
ple, ADD #4, 6 adds 4 to the contents of relative location 6 and stores the results in 
location 6; JUMP @5 transfers execution to the memory address contained in the lo- 
cation five slots past the location of the current JUMP instruction. 
a. The program Imp is the single instruction COPY 0, 1.What does it do? 
b. The program Dwarf is the following sequence of instructions: 


ADD #4, 3 
COPY 2, @2 
JUMP -2 
DATA 0 


What does it do? 
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Table B.4 CodeBlue Assembly Language 


(a) Instruction Set 


Format Meaning 


DATA  <value> <value> set at current location 
COPY A,B copies source A to destination B 
ADD A,B adds A to B, putting result in B 
JUMP A transfer execution to A 

JUMPZ A,B if B = 0, transfer to A 


(b) Addressing Modes 


Mode Format Meaning 


Literal # followed by value This is an immediate mode, the operand value is in the 


instruction. 


Relative Value The value represents an offset from the current location, 


which contains the operand. 


Indirect @ followed by value The value represents an offset from the current location; 
the offset location contains the relative address of the 
location that contains the operand. 


B.6 


c. Rewrite Dwarf using symbols, so that it looks more like a typical assembly langauge 
program. 


What happens if we pit Imp against Dwarf? 


Write a “carpet bombing” program in CodeBlue that zeros out all of memory (with 
the possible exception of the program locations). 


How would the following program fare against Imp? 


Loop COPY #0, -1 
JUMP -1 


Hint: Remember that instruction execution alternates between the two opposing 
programs. 


a. What is the value of the C status flag after the following sequence: 


mov al, 3 
add al, 4 


b. What is the value of the C status flag after the following sequence: 
mov al, 3 
sub al, 4 

Consider the following NAMS instruction: 


emp vleft, vright 
For signed integers, there are three status flags that are relevant. If vleft = vright, then 
ZF is set. If vleft > vright, ZF is unset (set to 0) and SF = OF If vleft < vright, ZF is 
unset and SF # OF. Why does SF = OF if vleft > vright? 
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B.7 Consider the following NASM code fragment: 


B.8 


B.9 


B.10 


B.11 


B.14 


mov al, 0 

cmp al, al 

je next 
Write an equivalent program consisting of a single instruction. 
Consider the following C program: 
/* a simple C program to average 3 integers */ 
main () 
{ int avg; 

int il = 20; 

int 22 5 133 

int i3 = 82; 

avg = (i1 + i2 + i3)/3; 
} 
Write an NASM version of this program. 
Consider the following C code fragment: 

if (EAX == 0) EBX = 1; 

else EBX = 2; 
Write an equivalent NASM code fragment. 


The initialize data directives can be used to initialize multiple locations. For example, 
db 0x55,0x56,0x57 
reserves three bytes and initializes their values. 


NASM supports the special token $ to allow calculations to involve the current as- 
sembly position. That is, $ evalutes to the assembly position at the beginning of the 
line containing the expression. With the preceding two facts in mind, consider the fol- 
lowing sequence of directives: 


message db ‘hello, world’ 
msglen equ $-message 
What value is assigned to the symbol msglen? 


Assume the three symbolic variables V1, V2, V3 contain integer values. Write an 

NASM code fragment that moves the smallest value into integer ax. Use only the 

instructions mov, cmp, and jbe. 

Describe the effect of this instruction: cmp eax, 1 

Assume that the immediately preceding instruction updated the contents of eax. 

The xchg instruction can be used to exchange the contents of two registers. Suppose 

that the x86 instruction set did not support this instruction. 

a. Implement xchg ax, bx using only push and pop instructions. 

b. Implement xchg ax, bx using only the xor instruction (do not involve other 
registers). 

In the following program, assume that a, b, x, y are symbols for main memory 

locations. What does the program do? You can answer the question by writing the 

equivalent logic in C. 


mov eax,a 
mov ebx,b 
xor eax, xX 


B.15 


B.16 


B.17 
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xor ebx,y 


or eax, ebx 
jnz L2 
Li: ; sequence of instructions... 
jmp L3 
L2: ; another sequence of instructions... 
L3: 
Section B.1 includes a C program that calculates the greatest common divisor of two 
integers. 
a. Describe the algorithm in words and show how the program does implement the 


b. 


Cc. 


Euclid algorithm approach to calculating the greatest common divisor. 

Add comments to the assembly program of Figure B.3a to clarify that it imple- 
ments the same logic as the C program. 

Repeat part (b) for the program of Figure B.3b. 


a. A 2-pass assembler can handle future symbols and an instruction can therefore 


b. 


use a future symbol as an operand. This is not always true for directives. The EQU 
directive, for example, cannot use a future symbol. The directive ‘A EQU B+1’ is 
easy to execute if B is previously defined, but impossible if B is a future symbol. 
What’s the reason for this? 

Suggest a way for the assembler to eliminate this limitation such that any source 
line could use future symbols. 


Consider a symbol directive MAX of the following form: 


symbol MAX list of expressions 


The label is mandatory and is assigned the value of the largest expression in the 
operand field. Example: 


MSGLEN MAX A, B, C ; where A, B, C are defined 
symbols 


How is MAX executed by the Assembler and in what pass? 
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Fractions 
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19.5 Key Terms and Problems 
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19.1 THE DECIMAL SYSTEM 


In everyday life we use a system based on decimal digits (0, 1, 2, 3, 4,5, 6,7, 8, 9) to 
represent numbers and refer to the system as the decimal system. Consider what the 
number 83 means. It means eight tens plus three: 


83 = (8 X 10) + 3 
The number 4728 means four thousands, seven hundreds, two tens, plus eight: 
4728 = (4 x 1000) + (7 x 100) + (2 xX 10) + 8 


The decimal system is said to have a base, or radix, of 10. This means that each 
digit in the number is multiplied by 10 raised to a power corresponding to that 
digit’s position: 
83 = (8 x 10') + (3 x 10°) 
4728 = (4 x 10°) + (7 X 10°) + (2 x 10!) + (8 x 10°) 
The same principle holds for decimal fractions but negative powers of 10 are 


used. Thus, the decimal fraction 0.256 stands for 2 tenths plus 5 hundredths plus 
6 thousandths: 


0.256 = (2 xX 10°') + (5 x 10°) + (6 x 10°) 


A number with both an integer and fractional part has digits raised to both 
positive and negative powers of 10: 


472.256 = (4 x 10’) + (7 x 10") + (2 x 10°) + (2 x 10°) + (5 x 10°) + (6 x 10°) 


In general, for the decimal representation of X = { 2. dsdjdyd_4d_sd5...}, the 
value of X is 


X= >, 10 (19.1) 


19.2 THE BINARY SYSTEM 


In the decimal system, 10 different digits are used to represent numbers with a base 
of 10. In the binary system, we have only two digits, 1 and 0. Thus, numbers in the bi- 
nary system are represented to the base 2. 

To avoid confusion, we will sometimes put a subscript on a number to indicate 
its base. For example, 83,9 and 4728,) are numbers represented in decimal notation 
or, more briefly, decimal numbers. The digits 1 and 0 in binary notation have the 
same meaning as in decimal notation: 


02 = O10 
1, = lo 
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To represent larger numbers, as with decimal notation, each digit in a binary number 
has a value depending on its position: 
10, = (1 x 2') + (0 X 2°) = 219 
15 = x2) 2°). = 39 
100, = (1 x 27) + (0 x 2') + (0 X 2°) = 44 
and so on. Again, fractional values are represented with negative powers of the radix: 
1001.101 = 2? + 2°+ 21+ 2% = 9.6254) 
In general, for the binary representation of Y = {. . bybybo.b_yb_xb_3... } the value 


of Y is 


¥ = Sb x 2/) (19.2) 


19.3 CONVERTING BETWEEN BINARY AND DECIMAL 


It is a simple matter to convert a number from binary notation to decimal notation. 
In fact, we showed several examples in the previous subsection. All that is required 
is to multiply each binary digit by the appropriate power of 2 and add the results. 

To convert from decimal to binary, the integer and fractional parts are handled 
separately. 
Integers 
For the integer part, recall that in binary notation, an integer represented by 

by—1Dm—2 one bob, bo b; =0Oorl 
has the value 
(Bm—1 X 21) + (by-» X 27) +--+ + (B, X 2) + Dy 


Suppose it is required to convert a decimal integer N into binary form. If we 
divide N by 2, in the decimal system, and obtain a quotient N; and a remainder Ro, 
we may write 


N=2XWN,+ Ro Ry = O0orl 


Next, we divide the quotient N, by 2. Assume that the new quotient is N, and the 
new remainder R,. Then 


N,=2XN,+R R, = Oorl 
so that 
N = 2(2N, + Ry) + Ro = (Np X 27) + (R, X 2') + Ry 
If next 


Ny = 2N3 + R, 
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we have 
N = (Ny X 23) + (Rp X 27) + (R, X 2!) + Ro 
Because N > N, > N3..., continuing this sequence will eventually produce a quo- 
tient N,,-1 = 1 (except for the decimal integers 0 and 1, whose binary equivalents 
are 0 and 1, respectively) and a remainder R,,,-», which is 0 or 1. Then 
N =(1 X 21) + (Ry_2 X 27) +--+ + (Rp X 27) + (R, X 2) + Ry 

which is the binary form of N. Hence, we convert from base 10 to base 2 by repeated 


divisions by 2. The remainders and the final quotient, 1, give us, in order of increas- 
ing significance, the binary digits of N. Figure 19.1 shows two examples. 


Quotient Remainder 
ial 
pas 5) 4 
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5 2 
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2 eG 9 
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TOUOT, = iy, 


(b) 21io 


Figure 19.1 Examples of Converting from 
Decimal Notation to Binary Notation for 
Integers 
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Fractions 


For the fractional part, recall that in binary notation, a number with a value between 
0 and 1 is represented by 


0.b_,b_,b_3... b; = Oorl 
and has the value 
(bX 27) + (bX 27%) + (bX 2%)--- 
This can be rewritten as 
21x (b, +21X (b,4+2'X* (b,+--: 


This expression suggests a technique for conversion. Suppose we want to con- 
vert the number F (0 < F < 1) from decimal to binary notation. We know that F 
can be expressed in the form 


F=21'xX (b, +21 x (b,+2'X (b44+--- 
If we multiply F by 2, we obtain, 
2X F=b,+2'X (b,+2'X(b,+--: 


From this equation, we see that the integer part of (2 < F), which must be ei- 
ther 0 or 1 because 0 < F < 1, is simply b_;. So we can say (2 X F) = b, + Fi, 
where 0 < F, < 1 and where 


F,=21x (b,+2'X (b,+2'X (byt: 


To find b_,, we repeat the process. Therefore, the conversion algorithm involves re- 
peated multiplication by 2. At each step, the fractional part of the number from the 
previous step is multiplied by 2. The digit to the left of the decimal point in the prod- 
uct will be 0 or 1 and contributes to the binary representation, starting with the most 
significant digit. The fractional part of the product is used as the multiplicand in the 
next step. Figure 19.2 shows two examples. 

This process is not necessarily exact; that is, a decimal fraction with a finite 
number of digits may require a binary fraction with an infinite number of digits. In 
such cases, the conversion algorithm is usually halted after a prespecified number of 
steps, depending on the desired accuracy. 


19.4 HEXADECIMAL NOTATION 


Because of the inherent binary nature of digital computer components, all forms of 
data within computers are represented by various binary codes. However, no matter 
how convenient the binary system is for computers, it is exceedingly cumbersome 
for human beings. Consequently, most computer professionals who must spend time 
working with the actual raw data in the computer prefer a more compact notation. 

What notation to use? One possibility is the decimal notation. This is certainly 
more compact than binary notation, but it is awkward because of the tediousness of 
converting between base 2 and base 10. 
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Product Integer Part 0.110011, 


0.81xX2 = 1.62 1 —__4 
| 

0.62X2 = 1.24 1 
pe ae 

0.24 x2 = 0.48 0 
Po ae 

0.48% 2 = 0.96 0 
La 

0.96 x2 = 1.92 1 
ae 

0.92 X2 = 1.84 1 


(a) 0.81,) = 0.110011, (approximately) 


Product Integer Part 0.01, 
Qo25 6 2 = Oos 0 


0.5 xX 2=1.0 1 


(b) 0.25,, = 0.01, (exactly) 


Figure 19.2 Examples of Converting from Decimal 
Notation to Binary Notation for Fractions 


Instead, a notation known as hexadecimal has been adopted. Binary digits are 
grouped into sets of four. Each possible combination of four binary digits is given a 
symbol, as follows: 


0000 = 0 0100 = 4 1000 = 8 1100 =C 
0001 = 1 0101 = 5 1001 = 9 1101 = D 
0010 = 2 0110 = 6 1010=A 1110 =E 
0011 = 3 0111 =7 1011 = B 1111 =F 


Because 16 symbols are used, the notation is called hexadecimal, and the 16 symbols 
are the hexadecimal digits. 

A sequence of hexadecimal digits can be thought of as representing an integer 
in base 16 (Table 19.1). Thus, 


2Ci6 = (216 x 16') + (Ci6 x 16°) 
= (210 x 16') + (1249 x 16°) = 44 


Hexadecimal notation is used not only for representing integers. It is also used 
as a concise notation for representing any sequence of binary digits, whether they 
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Table 19.1 Decimal, Binary, and Hexadecimal 


Decimal Binary Hexadecimal 
(base 10) (base 2) (base 16) 


0000 
0001 
0010 
0011 
0100 
0101 
0110 
0111 
1000 
1001 
1010 
1011 
1100 
1101 
1110 
1111 
0001 0000 
0001 0001 
0001 0010 
0001 0000 
0110 0100 
1111 0000 
0001 0000 0000 
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represent text, numbers, or some other type of data. The reasons for using hexadeci- 
mal notation are 
1. It is more compact than binary notation. 


2. In most computers, binary data occupy some multiple of 4 bits, and hence some 
multiple of a single hexadecimal digit. 


3. It is extremely easy to convert between binary and hexadecimal. 
As an example of the last point, consider the binary string 110111100001. This 
is equivalent to 
1101 =—1110 0001 = DEli¢ 
D E 1 
This process is performed so naturally that an experienced programmer can 


mentally convert visual representations of binary data to their hexadecimal equiva- 
lent without written effort 
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19.5 KEY TERMS AND PROBLEMS 


Key Terms 


base 
binary 


decimal 


fraction integer 
hexadecimal radix 


Problems 
19.1 Convert the following binary numbers to their decimal equivalents: 
a. 001100 b. 000011 c. 011100 d. 111100 e. 101010 
19.2 Convert the following binary numbers to their decimal equivalents: 
a. 11100.011 _b. 110011.10011 ce. 1010101010.1 
19.3 Convert the following decimal numbers to their binary equivalents: 
a. 64 b. 100 ce. 111 d. 145 e. 255 
19.4 Convert the following decimal numbers to their binary equivalents: 
a. 34.75 b. 25.25 ¢. 27.1875 
19.5 Express the following octal numbers in hexadecimal notation: 
a. 12 b. 5655 ce. 2550276 d. 76545336 e. 3726755, 
19.6 Convert the following hexadecimal numbers to their decimal equivalents: 
a. C b. 9F c. D52 d. 67E e. ABCD 
19.7 Convert the following hexadecimal numbers to their decimal equivalents: 
a. F4 b. D3.E ec. 1111.1 d. 888.8 e. EBA.C 
19.8 Convert the following decimal numbers to their hexadecimal equivalents: 
a. 16 b. 80 c. 2560 d. 3000 e. 62,500 
19.9 Convert the following decimal numbers to their hexadecimal equivalents: 
a. 204.125 b. 255.875 e. 631.25 d. 10000.00390625 

19.10 Convert the following hexadecimal numbers to their binary equivalents: 
aE b.1C c. A64 d. 1E.C e. 239.4 

19.11 Convert the following binary numbers to their hexadecimal equivalents: 

a. 1001.1111 b.110101.011001 —c. 10100111.111011 

19.12 Prove that every real number with a terminating binary representation (finite num- 
ber of digits to the right of the binary point) also has a terminating decimal represen- 
tation (finite number of digits to the right of the decimal point). 

19.13 Equations (19.1) and (19.2) define the representation of numbers in base 10 and base 2, re- 
spectively. In general, for the representation in base g of X = { ogg Kg XX MGW aes ps 
the value of X is 

Aue > (x; X g') 
Thus, 65 in base 7 is (6 X 7!) + (5 X 7°) = 47. 
Count from one to 20j9 in the following bases: 
a.8 b.6 ec. 5 d. 3 
19.14 Perform the indicated base conversions: 
a. 54, to base 5 
b. 312, to base 7 
c. 520. to base 7 
d. 12212; to base 9 
19.15 What generalizations can you draw about converting a number from one base to a power 


of that base, e.g., from base 3 to base 9 (3”) or from base 2 to base 4 (27) or base 8 (23)? 
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The operation of the digital computer is based on the storage and processing of binary 
data. Throughout this book, we have assumed the existence of storage elements that 
can exist in one of two stable states and of circuits than can operate on binary data 
under the control of control signals to implement the various computer functions. In 
this appendix, we suggest how these storage elements and circuits can be implemented 
in digital logic, specifically with combinational and sequential circuits. The appendix be- 
gins with a brief review of Boolean algebra, which is the mathematical foundation of 
digital logic. Next, the concept of a gate is introduced. Finally, combinational and 
sequential circuits, which are constructed from gates, are described. 


20.1 BOOLEAN ALGEBRA 


The digital circuitry in digital computers and other digital systems is designed, and 
its behavior is analyzed, with the use of a mathematical discipline known as Boolean 
algebra. The name is in honor of an English mathematician George Boole, who pro- 
posed the basic principles of this algebra in 1854 in his treatise, An Investigation of 
the Laws of Thought on Which to Found the Mathematical Theories of Logic and 
Probabilities. In 1938, Claude Shannon, a research assistant in the Electrical Engi- 
neering Department at M.ILT., suggested that Boolean algebra could be used to 
solve problems in relay-switching circuit design [SHAN38].' Shannon’s techniques 
were subsequently used in the analysis and design of electronic digital circuits. 
Boolean algebra turns out to be a convenient tool in two areas: 


e Analysis: It is an economical way of describing the function of digital circuitry. 
e Design: Given a desired function, Boolean algebra can be applied to develop a 


simplified implementation of that function. 


As with any algebra, Boolean algebra makes use of variables and operations. 
In this case, the variables and operations are logical variables and operations. Thus, 
a variable may take on the value 1 (TRUE) or 0 (FALSE). The basic logical opera- 
tions are AND, OR, and NOT, which are symbolically represented by dot, plus sign, 
and overbar:” 
A ANDB=A-B 
AORB=A+B 
NOTA =A 
The operation AND yields true (binary value 1) if and only if both of its operands are 
true. The operation OR yields true if either or both of its operands are true. The unary 
operation NOT inverts the value of its operand. For example, consider the equation 
D=A+(B-C) 
D is equal to 1 if A is 1 or if both B = 0 and C = 1. Otherwise D is equal to 0. 


'The paper is available at this book’s Web site. 
Logical NOT is often indicated by an apostrophe: NOT A = A’. 
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Table 20.1 Boolean Operators 


(a) Boolean Operators of Two Input Variables 


P NAND Q 
(P-Q) 


(b) Boolean Operators Extended to More than Two Inputs (A, B,...) 


Operation Expression Output = 1if 


AND A:B-... All of the set {A, B, ...} are 1. 
OR Ad ae Any of the set {A, B,...} are 1. 
Any of the set {A, B, ...} are 0. 
All of the set {A, B, ...} are 0. 


The set {A, B, ...} contains an odd number of ones. 


Several points concerning the notation are needed. In the absence of paren- 
theses, the AND operation takes precedence over the OR operation. Also, when no 
ambiguity will occur, the AND operation is represented by simple concatenation in- 
stead of the dot operator. Thus, 


A+B:C=A+(B:C)=A+ BC 


all mean: Take the AND of B and C; then take the OR of the result and A. 

Table 20.1a defines the basic logical operations in a form known as a truth 
table, which lists the value of an operation for every possible combination of values 
of operands. The table also lists three other useful operators: XOR, NAND, and 
NOR. The exclusive-or (XOR) of two logical operands is 1 if and only if exactly one 
of the operands has the value 1. The NAND function is the complement (NOT) of 
the AND function, and the NOR is the complement of OR: 


ANAND B = NOT (A AND B) = AB 
A NORB = NOT(A ORB) =A+B 


As we shall see, these three new operations can be useful in implementing certain 
digital circuits. 

The logical operations, with the exception of NOT, can be generalized to more 
than two variables, as shown in Table 20.1b. 

Table 20.2 summarizes key identities of Boolean algebra. The equations have 
been arranged in two columns to show the complementary, or dual, nature of the AND 
and OR operations. There are two classes of identities: basic rules (or postulates), which 
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Table 20.2 Basic Identities of Boolean Algebra 


Basic Postulates 


A‘B=B:A A+B=Bt+A Commutative Laws 
A:(B + C) = (A-B) + (A:C) A + (B-C) = (A+ B):-(A+C) Distributive Laws 
1A=A O+A=A Identity Elements 
A-A=0 A+A=1 Inverse Elements 


Other Identities 


0-A=0 1+A=1 
A‘A=A A+A=A 
A:(B-C) = (A-B)-C A+(B+C)=(A+B)+C Associative Laws 


A-B= 


A+B A+B=A-B DeMorgan’s Theorem 


are stated without proof, and other identities that can be derived from the basic postu- 
lates. The postulates define the way in which Boolean expressions are interpreted. One 
of the two distributive laws is worth noting because it differs from what we would find 
in ordinary algebra: 
A + (B:C) = (A + B)-(A + C) 

The two bottommost expressions are referred to as DeMorgan’s theorem. We can 
restate them as follows: 

ANORB = AANDB 

ANAND B = AORB 


The reader is invited to verify the expressions in Table 20.2 by substituting 
actual values (1s and 0s) for the variables A, B, and C. 


20.2 GATES 


The fundamental building block of all digital logic circuits is the gate. Logical func- 
tions are implemented by the interconnection of gates. 

A gate is an electronic circuit that produces an output signal that is a simple 
Boolean operation on its input signals. The basic gates used in digital logic are AND, 
OR, NOT, NAND, NOR, and XOR. Figure 20.1 depicts these six gates. Each gate is 
defined in three ways: graphic symbol, algebraic notation, and truth table. The sym- 
bology used here and throughout the appendix is the IEEE standard, IEEE Std 91. 
Note that the inversion (NOT) operation is indicated by a circle. 

Each gate shown in Figure 20.1 has one or two inputs and one output. Howev- 
er, as indicated in Table 20.1b, all of the gates except NOT can have more than two 
inputs. Thus, (X + Y + Z) can be implemented with a single OR gate with three in- 
puts. When one or more of the values at the input are changed, the correct output 
signal appears almost instantaneously, delayed only by the propagation time of 
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Name Graphical Symbol ee Truth Table 
ABLE 

A F=AeB 0 

AND 5] or o1lo 
F=AB 1O7)-0 

11/1 

ABIF 

A 00lo 

OR —| >— F=A+B o1}i 
B 1o0}l 

11/4 

F=A AF 

NOT a—)>o—r a of 1 

F=A' ae 

ABIF 

A _ 00/1 

NAND | po F=AB o1li 
B 1o0}1 

1ilo 

ABP 

& 0 0/1 

NOR | Det F=A+TB 0 1/0 
B 1 0]0 

1 1/0 

ABE 

A 0 0f0 

XOR >: F=A@B 011 
7 10|1 

1 1/0 


Figure 20.1 Basic Logic Gates 


signals through the gate (known as the gate delay). The significance of this delay is 
discussed in Section 20.3. In some cases, a gate is implemented with two outputs, one 
output being the negation of the other output. 
Here we introduce a common term: we say that to assert a signal is to cause 
signal line to make a transition from its logically false (0) state to its logically true 
(1) state. The true (1) state is either a high or low voltage state, depending on the 
type of electronic circuitry. 
Typically, not all gate types are used in implementation. Design and fabrication 
are simpler if only one or two types of gates are used. Thus, it is important to identify 
functionally complete sets of gates. This means that any Boolean function can be imple- 
mented using only the gates in the set. The following are functionally complete sets: 


e AND, OR, NOT 


° OR,NOT 
° NAND 
° NOR 


AND, NOT 
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A+B 


Figure 20.2 The Use of NAND Gates 


It should be clear that AND, OR, and NOT gates constitute a functionally 
complete set, because they represent the three operations of Boolean algebra. For 
the AND and NOT gates to form a functionally complete set, there must be a way to 
synthesize the OR operation from the AND and NOT operations. This can be done 
by applying DeMorgan’s theorem: 

A+B=A°:B 
A ORB = NOT ((NOT A) AND (NOT B)) 
Similarly, the OR and NOT operations are functionally complete because they can 
be used to synthesize the AND operation. 


Figure 20.2 shows how the AND, OR, and NOT functions can be implemented 
solely with NAND gates, and Figure 20.3 shows the same thing for NOR gates. For 


Figure 20.3 The Use of NOR Gates 
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this reason, digital circuits can be, and frequently are, implemented solely with 
NAND gates or solely with NOR gates. 

With gates, we have reached the most primitive circuit level of computer hard- 
ware. An examination of the transistor combinations used to construct gates departs 
from that realm and enters the realm of electrical engineering. For our purposes, 
however, we are content to describe how gates can be used as building blocks to 
implement the essential logical circuits of a digital computer. 


20.3 COMBINATIONAL CIRCUITS 


A combinational circuit is an interconnected set of gates whose output at any time is a 
function only of the input at that time. As with a single gate, the appearance of the input 
is followed almost immediately by the appearance of the output, with only gate delays. 

In general terms, a combinational circuit consists of n binary inputs and m bi- 
nary outputs. As with a gate, a combinational circuit can be defined in three ways: 


e Truth table: For each of the 2” possible combinations of input signals, the 
binary value of each of the m output signals is listed. 
e Graphical symbols: The interconnected layout of gates is depicted. 


e Boolean equations: Each output signal is expressed as a Boolean function of 
its input signals. 


Implementation of Boolean Functions 


Any Boolean function can be implemented in electronic form as a network of gates. 
For any given function, there are a number of alternative realizations. Consider the 
Boolean function represented by the truth table in Table 20.3. We can express this func- 
tion by simply itemizing the combinations of values of A, B, and C that cause F to be 1: 


F = ABC + ABC + ABC (20.1) 


There are three combinations of input values that cause F to be 1, and if any one 
of these combinations occurs, the result is 1. This form of expression, for self-evident 


Table 20.3. A Boolean Function of Three Variables 


> 
w=] 
©) 
= 


0 
0 
0 
0 
1 
1 
1 
1 
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A B Cc 


Figure 20.4 Sum-of-Products Implementation of Table 20.3 


reasons, is known as the sum of products (SOP) form. Figure 20.4 shows a straightfor- 
ward implementation with AND, OR, and NOT gates. 

Another form can also be derived from the truth table. The SOP form expresses 
that the output is 1 if any of the input combinations that produce 1 is true. We can 
also say that the output is 1 if none of the input combinations that produce 0 is true. 
Thus, 


F = (ABC):(ABC):(ABC):(ABC): (ABC) 
This can be rewritten using a generalization of DeMorgan’s theorem: 


(X°¥-Z)=X+Y+Z 


Thus, 
F=(A+B+C):(A+B+C):(A+B+C):(A+B+C):(A+B+C) (20.2) 
(A+B+C):(A+B+C):(A+B+C)-(A+B+C):(A+B+C) 


This is in the product of sums (POS) form, which is illustrated in Figure 20.5. For 
clarity, NOT gates are not shown. Rather, it is assumed that each input signal and its 
complement are available. This simplifies the logic diagram and makes the inputs to 
the gates more readily apparent. 

Thus, a Boolean function can be realized in either SOP or POS form. At this 
point, it would seem that the choice would depend on whether the truth table con- 
tains more 1s or Os for the output function: The SOP has one term for each 1, and the 
POS has one term for each 0. However, there are other considerations: 


e Itis often possible to derive a simpler Boolean expression from the truth table 
than either SOP or POS. 
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Am >I Am >l aly > QS 
be | 


1 >! 


Figure 20.5 Product-of-Sums 
Implementation of Table 20.3 


e It may be preferable to implement the function with a single gate type 
(NAND or NOR). 


The significance of the first point is that, with a simpler Boolean expression, 
fewer gates will be needed to implement the function. Three methods that can be 
used to achieve simplification are 

e Algebraic simplification 
e Karnaugh maps 
e Quine—McKluskey tables 


ALGEBRAIC SIMPLIFICATION Algebraic simplification involves the application of 
the identities of Table 20.2 to reduce the Boolean expression to one with fewer ele- 
ments. For example, consider again Equation (20.1). Some thought should convince 
the reader that an equivalent expression is 


F=AB+BC (20.3) 
Or, even simpler, 
F=B(A + C) 


This expression can be implemented as shown in Figure 20.6. The simplification of 
Equation (20.1) was done essentially by observation. For more complex expressions, 
some more systematic approach is needed. 
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a > 


F 
B 


Figure 20.6 Simplified Implementa- 
tion of Table A.3 


KARNAUGH MAPS For purposes of simplification, the Karnaugh map is a conve- 
nient way of representing a Boolean function of a small number (up to four) of vari- 
ables. The map is an array of 2” squares, representing all possible combinations of 
values of n binary variables. Figure 20.7a shows the map of four squares for a func- 
tion of two variables. It is essential for later purposes to list the combinations in the 
order 00, 01, 11, 10. Because the squares corresponding to the combinations are to 
be used for recording information, the combinations are customarily written above 
the squares. In the case of three variables, the representation is an arrangement of 
eight squares (Figure 20.7b), with the values for one of the variables to the left and 
for the other two variables above the squares. For four variables, 16 squares are 
needed, with the arrangement indicated in Figure 20.7c. 

The map can be used to represent any Boolean function in the following way. 
Each square corresponds to a unique product in the sum-of-products form, with a 1 
value corresponding to the variable and a 0 value corresponding to the NOT of that 


AB BC 


00 O01 11 10 00 O01 11 10 
[ [a 


(b) F = ABC + ABC + ABC 


CD 
oa M"(W Cc 
00 O1 11 10 — 
00 
Ol 
AB B 
11 
A 
10 
eae 
D 


(c) F= ABCD + ABCD + ABCD 
(d) Simplified labeling of map 


Figure 20.7 The Use of Karnaugh Maps to Represent Boolean 
Functions 
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variable. Thus, the product AB corresponds to the fourth square in Figure 20.7a. For 
each such product in the function, 1 is placed in the corresponding square. Thus, for 
the two-variable example, the map corresponds to AB + AB. Given the truth table 
of a Boolean function, it is an easy matter to construct the map: for each combina- 
tion of values of variables that produce a result of 1 in the truth table, fill in the cor- 
responding square of the map with 1. Figure 20.7b shows the result for the truth 
table of Table 20.3. To convert from a Boolean expression to a map, it is first neces- 
sary to put the expression into what is referred to as canonical form: each term in 
the expression must contain each variable. So, for example, if we have Equation 
(20.3), we must first expand it into the full form of Equation (20.1) and then convert 
this to a map. 

The labeling used in Figure 20.7d emphasizes the relationship between vari- 
ables and the rows and columns of the map. Here the two rows embraced by the 
symbol A are those in which the variable A has the value 1; the rows not embraced 
by the symbol A are those in which A is 0; similarly for B, C, and D. 

Once the map of a function is created, we can often write a simple algebraic 
expression for it by noting the arrangement of the 1s on the map. The principle is as 
follows. Any two squares that are adjacent differ in only one of the variables. If two 
adjacent squares both have an entry of one, then the corresponding product terms 
differ in only one variable. In such a case, the two terms can be merged by eliminat- 
ing that variable. For example, in Figure 20.8a, the two adjacent squares correspond 
to the two terms ABCD and ABCD. Thus, the function expressed is 


ABCD + ABCD = ABD 


This process can be extended in several ways. First, the concept of adjacency 
can be extended to include wrapping around the edge of the map. Thus, the top 
square of a column is adjacent to the bottom square, and the leftmost square of a 
row is adjacent to the rightmost square. These conditions are illustrated in Figures 
20.8b and c. Second, we can group not just 2 squares but 2” adjacent squares (that is, 
2, 4, 8, etc.). The next three examples in Figure 20.8 show groupings of 4 squares. 
Note that in this case, two of the variables can be eliminated. The last three exam- 
ples show groupings of 8 squares, which allow three variables to be eliminated. 

We can summarize the rules for simplification as follows: 


1. Among the marked squares (squares with a 1), find those that belong to a 
unique largest block of 1,2, 4, or 8 and circle those blocks. 


nN 


. Select additional blocks of marked squares that are as large as possible and as 
few in number as possible, but include every marked square at least once. The 
results may not be unique in some cases. For example, if a marked square com- 
bines with exactly two other squares, and there is no fourth marked square to 
complete a larger group, then there is a choice to be made as two which of the 
two groupings to choose. When you are circling groups, you are allowed to use 
the same 1 value more than once. 


3. Continue to draw loops around single marked squares, or pairs of adjacent 
marked squares, or groups of four, eight, and so on in such a way that every 
marked square belongs to at least one loop; then use as few of these blocks as 
possible to include all marked squares. 
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Figure 20.8 The Use of Karnaugh Maps 


Figure 20.9a, based on Table 20.3, illustrates the simplification process. If any 
isolated 1s remain after the groupings, then each of these is circled as a group of 1s. 
Finally, before going from the map to a simplified Boolean expression, any group of 
1s that is completely overlapped by other groups can be eliminated. This is shown in 
Figure 20.9b. In this case, the horizontal group is redundant and may be ignored in 
creating the Boolean expression. 

One additional feature of Karnaugh maps needs to be mentioned. In some 
cases, certain combinations of values of variables never occur, and therefore the cor- 
responding output never occurs. These are referred to as “don’t care” conditions. 
For each such condition, the letter “d” is entered into the corresponding square of 
the map. In doing the grouping and simplification, each “d” can be treated as a 1 or 0, 
whichever leads to the simplest expression. 
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(b) F= BCD + ACD 


Figure 20.9 Overlapping Groups 


An example, presented in [HAYE98], illustrates the points we have been dis- 
cussing. We would like to develop the Boolean expressions for a circuit that adds 1 toa 
packed decimal digit. Recall from Section 9.2 that with packed decimal, each decimal 
digit is represented by a 4-bit code, in the obvious way. Thus, 0 = 0000, 1 = 0001, ..., 
8 = 1000, and 9 = 1001. The remaining 4-bit values, from 1010 to 1111, are not 
used. This code is also referred to as Binary Coded Decimal (BCD). 

Table 20.4 shows the truth table for producing a 4-bit result that is one more 
than a 4-bit BCD input. The addition is modulo 10. Thus, 9 + 1 = 0. Also, note that 
six of the input codes produce “don’t care” results, because those are not valid BCD 
inputs. Figure 20.10 shows the resulting Karnaugh maps for each of the output vari- 
ables. The d squares are used to achieve the best possible groupings. 


THE QUINE—MCKLUSKEY METHOD For more than four variables, the Karnaugh 
map method becomes increasingly cumbersome. With five variables, two 16 X 16 
maps are needed, with one map considered to be on top of the other in three di- 
mensions to achieve adjacency. Six variables require the use of four 16 X 16 tables 
in four dimensions! An alternative approach is a tabular technique, referred to as 
the Quine—McKluskey method. The method is suitable for programming on a com- 
puter to give an automatic tool for producing minimized Boolean expressions. 

The method is best explained by means of an example. Consider the following 
expression: 


ABCD + ABCD + ABC D + ABCD + ABCD + ABCD + ABCD + AB CD 


Let us assume that this expression was derived from a truth table. We would like to 
produce a minimal expression suitable for implementation with gates. 
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Table 20.4 Truth Table for the One-Digit Packed Decimal Incrementer 
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The first step is to construct a table in which each row corresponds to one of the 
product terms of the expression. The terms are grouped according to the number of 
complemented variables. That is, we start with the term with no complements, if it ex- 
ists, then all terms with one complement, and so on. Table 20.5 shows the list for our 
example expression, with horizontal lines used to indicate the grouping. For clarity, 
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Figure 20.10 Karnaugh Maps for the Incrementer 
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Table 20.5 First Stage of Quine-McKluskey Method 
(for F = ABCD + ABCD + ABCD + ABCD + ABCD + ABCD + ABCD + ABCD) 
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each term is represented by a 1 for each uncomplemented variable and a 0 for each 
complemented variable. Thus, we group terms according to the number of 1s they con- 
tain. The index column is simply the decimal equivalent and is useful in what follows. 
The next step is to find all pairs of terms that differ in only one variable, that is, 
all pairs of terms that are the same except that one variable is 0 in one of the terms 
and 1 in the other. Because of the way in which we have grouped the terms, we can 
do this by starting with the first group and comparing each term of the first group 
with every term of the second group. Then compare each term of the second group 
with all of the terms of the third group, and so on. Whenever a match is found, place 
a check next to each term, combine the pair by eliminating the variable that differs 
in the two terms, and add that to a new list. Thus, for example, the terms ABCD and 
ABCD are combined to produce ABC. This process continues until the entire origi- 
nal table has been examined. The result is a new table with the following entries: 


ACD ABC ABD, 


BCD, ACD 
ABC BCD, 
ABD, 


The new table is organized into groups, as indicated, in the same fashion as the 
first table. The second table is then processed in the same manner as the first. That is, 
terms that differ in only one variable are checked and a new term produced for a third 
table. In this example, the third table that is produced contains only one term: BD. 

In general, the process would proceed through successive tables until a table 
with no matches was produced. In this case, this has involved three tables. 

Once the process just described is completed, we have eliminated many of the 
possible terms of the expression. Those terms that have not been eliminated are used to 
construct a matrix, as illustrated in Table 20.6. Each row of the matrix corresponds to 
one of the terms that have not been eliminated (has no check) in any of the tables used 
so far. Each column corresponds to one of the terms in the original expression. An X is 
placed at each intersection of a row and a column such that the row element is “com- 
patible” with the column element. That is, the variables present in the row element 
have the same value as the variables present in the column element. Next, circle each X 
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Table 20.6 Last Stage of Quine-McKluskey Method 


(for F = ABCD + ABCD + ABCD + ABCD + ABCD + ABCD + ABCD + ABCD) 


that is alone in a column. Then place a square around each X in any row in which there 
is a circled X. If every column now has either a squared or a circled X, then we are 
done, and those row elements whose Xs have been marked constitute the minimal 
expression. Thus, in our example, the final expression is 


ABC + ACD + ABC + A CD 


In cases in which some columns have neither a circle nor a square, additional 
processing is required. Essentially, we keep adding row elements until all columns 
are covered. 

Let us summarize the Quine—McKluskey method to try to justify intuitively 
why it works. The first phase of the operation is reasonably straightforward. The 
process eliminates unneeded variables in product terms. Thus, the expression 
ABC + ABC is equivalent to AB, because 


ABC + ABC = AB(C + GC) = AB 


After the elimination of variables, we are left with an expression that is clearly 
equivalent to the original expression. However, there may be redundant terms in 
this expression, just as we found redundant groupings in Karnaugh maps. The matrix 
layout assures that each term in the original expression is covered and does so in a 
way that minimizes the number of terms in the final expression. 


NAND AND NOR IMPLEMENTATIONS Another consideration in the implementa- 
tion of Boolean functions concerns the types of gates used. It is sometimes desirable to 
implement a Boolean function solely with NAND gates or solely with NOR gates. Al- 
though this may not be the minimum-gate implementation, it has the advantage of reg- 
ularity, which can simplify the manufacturing process. Consider again Equation (20.3): 


F=BA+0C) 
Because the complement of the complement of a value is just the original value, 
F = B(A + ©) = (AB) + (BC) 
Applying DeMorgan’s theorem, 
F = (AB): (BC) 
which has three NAND forms, as illustrated in Figure 20.11. 
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Figure 20.11 NAND Implementation of 
Table 20.3 


Multiplexers 


The multiplexer connects multiple inputs to a single output. At any time, one of the 
inputs is selected to be passed to the output. A general block diagram representa- 
tion is shown in Figure 20.12. This represents a 4-to-1 multiplexer. There are four 
input lines, labeled DO, D1, D2, and D3. One of these lines is selected to provide the 
output signal F. To select one of the four possible inputs, a 2-bit selection code is 
needed, and this is implemented as two select lines labeled S1 and S2. 

An example 4-to-1 multiplexer is defined by the truth table in Table 20.7. This 
is a simplified form of a truth table. Instead of showing all possible combinations of 
input variables, it shows the output as data from line DO, D1, D2, or D3. Figure 
20.13 shows an implementation using AND, OR, and NOT gates. S1 and S2 are con- 
nected to the AND gates in such a way that, for any combination of S1 and S2, 
three of the AND gates will output 0. The fourth AND gate will output the value of 
the selected line, which is either 0 or 1. Thus, three of the inputs to the OR gate are 
always 0, and the output of the OR gate will equal the value of the selected input 
gate. Using this regular organization, it is easy to construct multiplexers of size 
8-to-1, 16-to-1, and so on. 

Multiplexers are used in digital circuits to control signal and data routing. An 
example is the loading of the program counter (PC). The value to be loaded into the 
program counter may come from one of several different sources: 


e A binary counter, if the PC is to be incremented for the next instruction 


DO 
mY 4-to-1 F 
D2 MUX 
D3 
$2 Sl 


Figure 20.12 4-to-1 Multiplexer 
Representation 
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Table 20.7 4-to-1 Multiplexer Truth Table 


e The instruction register, if a branch instruction using a direct address has just 
been executed 


e The output of the ALU, if the branch instruction specifies the address using a 
displacement mode 


These various inputs could be connected to the input lines of a multiplexer, with the 
PC connected to the output line. The select lines determine which value is loaded 
into the PC. Because the PC contains multiple bits, multiple multiplexers are used, 
one per bit. Figure 20.14 illustrates this for 16-bit addresses. 


Decoders 


A decoder is a combinational circuit with a number of output lines, only one of 
which is asserted at any time, dependent on the pattern of input lines. In general, a 


$2 Sl 


DO 


D1 


D2 


Figure 20.13 Multiplexer Implementation 
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Figure 20.14 Multiplexer Input to Program Counter 


decoder has n inputs and 2” outputs. Figure 20.15 shows a decoder with three inputs 
and eight outputs. 

Decoders find many uses in digital computers. One example is address decod- 
ing. Suppose we wish to construct a 1K-byte memory using four 256 xX 8—bit RAM 
chips. We want a single unified address space, which can be broken down as follows: 


Address Chip 
0000-00FF 0 
0100-01 FF 1 
0200-02FF 2 
0300-03FF 3 


Figure 20.15 Decoder with 3 Inputs and 2° = 8 Outputs 
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Figure 20.16 Address Decoding 


Each chip requires 8 address lines, and these are supplied by the lower-order 8 
bits of the address. The higher-order 2 bits of the 10-bit address are used to select 
one of the four RAM chips. For this purpose, a 2-to-4 decoder is used whose output 
enables one of the four chips, as shown in Figure 20.16. 

With an additional input line, a decoder can be used as a demultiplexer. The 
demultiplexer performs the inverse function of a multiplexer; it connects a single 
input to one of several outputs. This is shown in Figure 20.17. As before, m inputs are 
decoded to produce a single one of 2” outputs. All of the 2” output lines are ANDed 
with a data input line. Thus, the n inputs act as an address to select a particular out- 
put line, and the value on the data input line (0 or 1) is routed to that output line. 

The configuration in Figure 20.17 can be viewed in another way. Change the 
label on the new line from Data Input to Enable. This allows for the control of the 
timing of the decoder. The decoded output appears only when the encoded input is 
present and the enable line has a value of 1. 


Read-Only Memory 


Combinational circuits are often referred to as “memoryless” circuits, because their 
output depends only on their current input and no history of prior inputs is retained. 


e e 
. e° e 
N-bit 
re ° e 
destination . : - 
address x . 2” outputs 
° N-to-2" decoder 
e 
e 
e 
Data input 


Figure 20.17 Implementation of a Demultiplexer Using 
a Decorder 
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Table 20.8 Truth Table fora ROM 
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However, there is one sort of memory that is implemented with combinational cir- 
cuits, namely read-only memory (ROM). 

Recall that a ROM is a memory unit that performs only the read operation. 
This implies that the binary information stored in a ROM is permanent and was cre- 
ated during the fabrication process. Thus, a given input to the ROM (address lines) 
always produces the same output (data lines). Because the outputs are a function 
only of the present inputs, the ROM is in fact a combinational circuit. 

A ROM can be implemented with a decoder and a set of OR gates. As an ex- 
ample, consider Table 20.8. This can be viewed as a truth table with four inputs and 
four outputs. For each of the 16 possible input values, the corresponding set of val- 
ues of the outputs is shown. It can also be viewed as defining the contents of a 64-bit 
ROM consisting of 16 words of 4 bits each. The four inputs specify an address, and 
the four outputs specify the contents of the location specified by the address. Figure 
20.18 shows how this memory could be implemented using a 4-to-16 decoder and 
four OR gates. As with the PLA, a regular organization is used, and the intercon- 
nections are made to reflect the desired result. 


Adders 


So far, we have seen how interconnected gates can be used to implement such func- 
tions as the routing of signals, decoding, and ROM. One essential area not yet 
addressed is that of arithmetic. In this brief overview, we will content ourselves with 
looking at the addition function. 
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Figure 20.18 A 64-Bit ROM 
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Binary addition differs from Boolean algebra in that the result includes a carry 


term. Thus, 
0 
+ 0 +1 + 0 
0 1 1 


However, addition can still be dealt with in Boolean terms. In Table 20.9a, we show the 
logic for adding two input bits to produce a 1-bit sum and a carry bit. This truth table 


Table 20.9 Binary Addition Truth Tables 


(a) Single-Bit Addition 


(b) Addition with Carry Input 


A 


B 


Sum 


0 0 
0 0 
0 il 
0 il 
il 0 
1 0 
il 1 
il il 
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Figure 20.19 4-Bit Adder 


could easily be implemented in digital logic. However, we are not interested in per- 
forming addition on just a single pair of bits. Rather, we wish to add two n-bit num- 
bers. This can be done by putting together a set of adders so that the carry from one 
adder is provided as input to the next. A 4-bit adder is depicted in Figure 20.19. 

For a multiple-bit adder to work, each of the single-bit adders must have three 
inputs, including the carry from the next-lower-order adder. The revised truth table 
appears in Table 20.9b. The two outputs can be expressed: 


Sum = A BC + ABC + ABC + ABC 


Carry = AB + AC + BC 


Figure 20.20 is an implementation using AND, OR, and NOT gates. 


Sum 


Carry 


Figure 20.20 Implementation of an Adder 
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Figure 20.21 Construction of a 32-Bit Adder Using 8-Bit Adders 


Thus we have the necessary logic to implement a multiple-bit adder such as 
shown in Figure 20.21. Note that because the output from each adder depends on 
the carry from the previous adder, there is an increasing delay from the least significant 
to the most significant bit. Each single-bit adder experiences a certain amount of 
gate delay, and this gate delay accumulates. For larger adders, the accumulated delay 
can become unacceptably high. 

If the carry values could be determined without having to ripple through all 
the previous stages, then each single-bit adder could function independently, and 
delay would not accumulate. This can be achieved with an approach known as carry 
lookahead. Let us look again at the 4-bit adder to explain this approach. 

We would like to come up with an expression that specifies the carry input to 
any stage of the adder without reference to previous carry values. We have 


Cy = AoBo (20.4) 
C, = A,B, + A,ApBy + ByAoBo (20.5) 
Following the same procedure, we get 


C, = A,B, + A,A,B, + A,A,;ApBy + A,B, ApBy + B,A,B, 
+ B,A,A,By + B,B,ApBo 


This process can be repeated for arbitrarily long adders. Each carry term can be ex- 
pressed in SOP form as a function only of the original inputs, with no dependence 
on the carries. Thus, only two levels of gate delay occur regardless of the length of 
the adder. 

For long numbers, this approach becomes excessively complicated. Evaluating 
the expression for the most significant bit of an n-bit adder requires an OR gate 
with n — 1 inputs and n AND gates with from 2 to n + 1 inputs. Accordingly, full 
carry lookahead is typically done only 4 to 8 bits at a time. Figure 20.21 shows how a 
32-bit adder can be constructed out of four 8-bit adders. In this case, the carry must 
ripple through the four 8-bit adders, but this will be substantially quicker than a 
ripple through thirty-two 1-bit adders. 


20.4 SEQUENTIAL CIRCUITS 


Combinational circuits implement the essential functions of a digital computer. 
However, except for the special case of ROM, they provide no memory or state in- 
formation, elements also essential to the operation of a digital computer. For the 
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latter purposes, a more complex form of digital logic circuit is used: the sequential 
circuit. The current output of a sequential circuit depends not only on the current 
input, but also on the past history of inputs. Another and generally more useful way 
to view it is that the current output of a sequential circuit depends on the current 
input and the current state of that circuit. 

In this section, we examine some simple but useful examples of sequential cir- 
cuits. As will be seen, the sequential circuit makes use of combinational circuits. 


Flip-Flops 


The simplest form of sequential circuit is the flip-flop. There are a variety of flip- 
flops, all of which share two properties: 


e The flip-flop is a bistable device. It exists in one of two states and, in the absence 
of input, remains in that state. Thus, the flip-flop can function as a 1-bit memory. 


e The flip-flop has two outputs, which are always the complements of each other. 
These are generally labeled Q and Q. 


THE S—R LATCH Figure 20.22 shows a common configuration known as the S-R 
flip-flop or S-R latch. The circuit has two inputs, S (Set) and R (Reset), and two out- 
puts, Q and Q, and consists of two NOR gates connected in a feedback arrangement. 

First, let us show that the circuit is bistable. Assume that both S and R are 0 
and that Q is 0. The inputs to the lower NOR gate are Q = 0 and S = O. Thus, the 
output QO = 1 means that the inputs to the upper NOR gate are Q = 1 andR = 0, 
which has the output Q = 0. Thus, the state of the circuit is internally consistent and 
remains stable as long as S = R = 0. A similar line of reasoning shows that the state 
QO = 1,0 = Oisalso stable for R = S = 0. 

Thus, this circuit can function as a 1-bit memory. We can view the output Q as 
the “value” of the bit. The inputs S and R serve to write the values 1 and 0, respec- 
tively, into memory. To see this, consider the state Q = 0,O = 1,S = 0,R = 0. Sup- 
pose that S changes to the value 1. Now the inputs to the lower NOR gate are S = 1, 
Q = 0. After some time delay At, the output of the lower NOR gate will be Q = 0 


fe) 


S 


Figure 20.22 The S—R Latch Implemented 
with NOR Gates 
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Figure 20.23 NOR S-R Latch timing Diagram 


(see Figure 20.23). So, at this point in time, the inputs to the upper NOR gate be- 
come R = 0,Q = 0. After another gate delay of At, the output Q becomes 1.This is 
again a stable state. The inputs to the lower gate are now S = 1, Q = 1, which main- 
tain the output Q = 0. As long as S = 1 and R = 0, the outputs will remain Q = 1, 
Q = 0. Furthermore, if S returns to 0, the outputs will remain unchanged. 

The R output performs the opposite function. When R goes to 1, it forces 
Q = 0, OQ = 1 regardless of the previous state of Q and Q. Again, a time delay of 
2At occurs before the final state is established (Figure 20.23). 

The S-R latch can be defined with a table similar to a truth table, called a 
characteristic table, which shows the next state or states of a sequential circuit as a func- 
tion of current states and inputs. In the case of the S—R latch, the state can be defined 
by the value of Q. Table 20.10a shows the resulting characteristic table. Observe that 
the inputs S = 1, R = 1 are not allowed, because these would produce an inconsistent 
output (both Q and Q equal 0). The table can be expressed more compactly, as in Table 
20.10b. An illustration of the behavior of the S—R latch is shown in Table 20.10c. 


CLOCKED S-—R FLIP-FLOP The output of the S—R latch changes, after a brief time 
delay, in response to a change in the input. This is referred to as asynchronous oper- 
ation. More typically, events in the digital computer are synchronized to a clock 
pulse, so that changes occur only when a clock pulse occurs. Figure 20.24 shows this 
arrangement. This device is referred to as a clocked S-R flip-flop. Note that the R 
and S inputs are passed to the NOR gates only during the clock pulse. 


D FLIP-FLOP One problem with S-R flip-flop is that the condition R = 1,S = 1 
must be avoided. One way to do this is to allow just a single input. The D flip-flop 
accomplishes this. Figure 20.25 shows a gate implementation and the characteristic 
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Table 20.10 The S-R Latch 


(a) Characteristic Table (b) Simplified Characteristic Table 


Current Current Next R Qn44 
Inputs State State 


0 
1 
0 
1 
0 
1 
0 
1 


(c) Response to Series of Inputs 


3 
0 
1 
0 


Clock 


el 


S 


Figure 20.24 Clocked S-R Flip Flop 


D 
Figure 20.25  D Flip Flop 
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Figure 20.26 J-K Flip Flop 


table of the D flip-flop. By using an inverter, the nonclock inputs to the two AND 
gates are guaranteed to be the opposite of each other. 

The D flip-flop is sometimes referred to as the data flip-flop because it is, in 
effect, storage for one bit of data. The output of the D flip-flop is always equal to the 
most recent value applied to the input. Hence, it remembers and produces the last 
input. It is also referred to as the delay flip-flop, because it delays a 0 or 1 applied to 
its input for a single clock pulse. We can capture the logic of the D flip-flop in the 
following truth table: 


D Qiu+1 
0 0 
1 1 


J-K Fuip-FLop Another useful flip-flop is the J-K flip-flop. Like the S—R flip-flop, 
it has two inputs. However, in this case all possible combinations of input values are 
valid. Figure 20.26 shows a gate implementation of the J—K flip-flop, and Figure 
20.27 shows its characteristic table (along with those for the S-R and D flip-flops). 
Note that the first three combinations are the same as for the S-R flip-flop. With no 
input asserted, the output is stable. If only the J input is asserted, the result is a set 
function, causing the output to be 1; if only the K in put is asserted, the result is a 
reset function, causing the output to be 0. When both J and K are 1, the function per- 
formed is referred to as the toggle function: the output is reversed. Thus, if Q is 1 and 
1 is applied to J and K, then Q becomes 0. The reader should verify that the imple- 
mentation of Figure 20.26 produces this characteristic function. 


Registers 


As an example of the use of flip-flops, let us first examine one of the essential ele- 
ments of the CPU: the register. As we know, a register is a digital circuit used within 
the CPU to store one or more bits of data. Two basic types of registers are common- 
ly used: parallel registers and shift registers. 


PARALLEL REGISTERS A parallel register consists of a set of 1-bit memories that 
can be read or written simultaneously. It is used to store data. The registers that we 
have discussed throughout this book are parallel registers. 
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Figure 20.27 Basic Flip-Flops 


The 8-bit register of Figure 20.28 illustrates the operation of a parallel 
register using D flip-flops. A control signal, labeled load, controls writing into 
the register from signal lines, D11 through D18. These lines might be the output 
of multiplexers, so that data from a variety of sources can be loaded into the 


register. 
Data lines 
D18 D17 D16 D15 D14 D13 D12 Dil 

D Q DQ D Q DQ D Q DQ » @ D Q 

Clk }> Clk Clk }> Clk }> Clk }> Clk }> Clk {> Clk 
Clock + ‘N 
Load + y 

D08 DO7 D06 DOS D04 D03 D02 DOL 


Output lines 
Figure 20.28 8-Bit Parallel Register 
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Clock 
Figure 20.29 5-Bit Shift Register 


SHIFT REGISTER A shift register accepts and/or transfers information serially. Con- 
sider, for example, Figure 20.29, which shows a 5-bit shift register constructed from 
clocked D flip-flops. Data are input only to the leftmost flip-flop. With each clock pulse, 
data are shifted to the right one position, and the rightmost bit is transferred out. 

Shift registers can be used to interface to serial I/O devices. In addition, they can 
be used within the ALU to perform logical shift and rotate functions. In this latter ca- 
pacity, they need to be equipped with parallel read/write circuitry as well as serial. 


Counters 


Another useful category of sequential circuit is the counter. A counter is a register 
whose value is easily incremented by 1 modulo the capacity of the register; that is, 
after the maximum value is achieved the next increment sets the counter value to 0. 
Thus, a register made up of n flip-flops can count up to 2” — 1. An example of a 
counter in the CPU is the program counter. 

Counters can be designated as asynchronous or synchronous, depending on 
the way in which they operate. Asynchronous counters are relatively slow because 
the output of one flip-flop triggers a change in the status of the next flip-flop. In a 
synchronous counter, all of the flip-flops change state at the same time. Because the 
latter type is much faster, it is the kind used in CPUs. However, it is useful to begin 
the discussion with a description of an asynchronous counter. 


RIPPLE COUNTER An asynchronous counter is also referred to as a ripple counter, 
because the change that occurs to increment the counter starts at one end and “rip- 
ples” through to the other end. Figure 20.30 shows an implementation of a 4-bit 
counter using J—-K flip-flops, together with a timing diagram that illustrates its be- 
havior. The timing diagram is idealized in that it does not show the propagation 
delay that occurs as the signals move down the series of flip-flops. The output of the 
leftmost flip-flop (Qo) is the least significant bit. The design could clearly be extended 
to an arbitrary number of bits by cascading more flip-flops. 

In the illustrated implementation, the counter is incremented with each clock 
pulse. The J and K inputs to each flip-flop are held at a constant 1. This means that, 
when there is a clock pulse, the output at Q will be inverted (1 to 0; 0 to 1). Note that 
the change in state is shown as occurring with the falling edge of the clock pulse; this 
is known as an edge-triggered flip-flop. Using flip-flops that respond to the transi- 
tion in a clock pulse rather than the pulse itself provides better timing control in 
complex circuits. If one looks at patterns of output for this counter, it can be seen 
that it cycles through 0000, 0001, ..., 1110, 1111, 0000, and so on. 
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Figure 20.30 Ripple Counter 


SYNCHRONOUS COUNTERS The ripple counter has the disadvantage of the delay 
involved in changing value, which is proportional to the length of the counter. To 
overcome this disadvantage, CPUs make use of synchronous counters, in which all 
of the flip-flops of the counter change at the same time. In this subsection, we pre- 
sent a design for a 3-bit synchronous counter. In doing so, we illustrate some basic 
concepts in the design of a synchronous circuit. 

For a 3-bit counter, three flip-flops will be needed. Let us use J—K flip-flops. 
Label the uncomplemented output of the three flip-flops A, B, C respectively, with C 
representing the least significant bit. The first step is to construct a truth table that 
relates the J-K inputs and outputs, to allow us to design the overall circuit. Such a 
truth table is shown in Figure 20.31a.The first three columns show the possible com- 
binations of outputs A, B, and C. They are listed in the order that they will appear as 
the counter is incremented. Each row lists the current value of A, B, C and the inputs 
to the three flip-flops that will be required to reach the next value of A, B, C. 

To understand the way in which the truth table of Figure 20.31a is constructed, 
it may be helpful to recast the characteristic table for the J—K flip-flop. Recall that 
this table was presented as follows: 


J K Q, +1 
0 0 Q, 
0 1 0 

1 0 a 

1 1 Qnet 
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(c) Logic diagram 
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Figure 20.31 Design of a Synchronous Counter 


In this form, the table shows the effect that the J and K inputs have on the output. 
Now consider the following organization of the same information: 


Q, J K Quit 
0 0 d 0 
0 1 d 1 
1 d 1 0 
1 d 0 1 
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In this form, the table provides the value of the next output when the inputs and the 
present output are known. This is exactly the information needed to design the 
counter or, indeed, any sequential circuit. In this form, the table is referred to as an 
excitation table. 

Let us return to Figure 20.31a. Consider the first row. We want the value of A 
to remain 0, the value of B to remain 0, and the value of C to go from 0 to 1 with 
the next application of a clock pulse. The excitation table shows that to maintain 
an output of 0, we must have inputs of J = 0 and don’t care for K. To effect a tran- 
sition from 0 to 1, the inputs must be J = 1 and K = d. These values are shown in 
the first row of the table. By similar reasoning, the remainder of the table can be 
filled in. 

Having constructed the truth table of Figure 20.31a, we see that the table 
shows the required values of all of the J and K inputs as functions of the current 
values of A, B, and C. With the aid of Karnaugh maps, we can develop Boolean ex- 
pressions for these six functions. This is shown in part b of the figure. For example, 
the Karnaugh map for the variable Ja (the J input to the flip-flop that produces the 
A output) yields the expression Ja = BC. When all six expressions are derived, 
it is a straightforward matter to design the actual circuit, as shown in part c of 
the figure. 
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Thus far, we have treated individual gates as building blocks, from which arbitrary 
functions can be realized. The designer could pursue a strategy of minimizing the 
number of gates to be used by manipulating the corresponding Boolean expressions. 

As the level of integration provided by integrated circuits increases, other con- 
siderations apply. Early integrated circuits, using small-scale integration (SSI), pro- 
vided from one to ten gates on a chip. Each gate is treated independently, in the 
building-block approach described so far. Figure 20.32 is an example of some SSI 
chips. To construct a logic function, a number of these chips are laid out on a printed 
circuit board and the appropriate pin interconnections are made. 

Increasing levels of integration made it possible to put more gates on a chip 
and to make gate interconnections on the chip as well. This yields the advantages of 
decreased cost, decreased size, and increased speed (because on-chip delays are of 
shorter duration than off-chip delays). A design problem arises, however. For each 
particular logic function or set of functions, the layout of gates and interconnections 
on the chip must be designed. The cost and time involved in such custom chip design 
is high. Thus, it becomes attractive to develop a general-purpose chip that can be 
readily adapted to specific purposes. This is the intent of the programmable logic 
device (PLD). 

There are a number of different types of PLDs in commercial use. Table 20.11 
lists some of the key terms and defines some of the most important types. In this sec- 
tion, we first look at one of the simplest such devices, the programmable logic array 
(PLA) and then introduce perhaps the most important and widely used type of 
PLD, the field-programmable gate array (FPGA). 
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Figure 20.32 Some SSI Chips. Pin layouts from The TTL Data Book for Design Engineers, 
copyright © 1976 Texas Instrument Incorporated. 
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Table 20.11 PLD Terminology 


Programmable Logic Device (PLD) 

A general term that refers to any type of integrated circuit used for implementing digital hardware, 
where the chip can be configured by the end user to realize different designs. Programming of such a 
device often involves placing the chip into a special programming unit, but some chips can also be 
configured “in-system.” Also referred to as a field-programmable device (FPD). 

Programmable Logic Array (PLA) 

A relatively small PLD that contains two levels of logic, an AND-plane and an OR-plane, where 
both levels are programmable. 

Programmable Array Logic (PAL) 

A relatively small PLD that has a programmable AND-plane followed by a fixed OR-plane. 


Simple PLD (SPLD) 
A PLA or PAL. 


Complex PLD (CPLD) 


A more complex PLD that consists of an arrangement of multiple SPLD-like blocks on a single 
chip. 


Field-Programmable Gate Array (FPGA) 


A PLD featuring a general structure that allows very high logic capacity. Whereas CPLDs feature 
logic resources with a wide number of inputs (AND planes), FPGAs offer more narrow logic 
resources. FPGAs also offer a higher ratio of flip-flops to logic resources than do CPLDs. 


Logic Block 


A relatively small circuit block that is replicated in an array in an FPD. When a circuit is implemented 
in an FPD, it is first decomposed into smaller sub-circuits that can each be mapped into a logic 
block. The term logic block is mostly used in the context of FPGAs, but it could also refer to a block 
of circuitry ina CPLD. 


Programmable Logic Array 


The PLA is based on the fact that any Boolean function (truth table) can be ex- 
pressed in a sum-of-products (SOP) form, as we have seen. The PLA consists of a 
regular arrangement of NOT, AND, and OR gates on a chip. Each chip input is 
passed through a NOT gate so that each input and its complement are available to 
each AND gate. The output of each AND gate is available to each OR gate, and the 
output of each OR gate is a chip output. By making the appropriate connections, ar- 
bitrary SOP expressions can be implemented. 

Figure 20.33a shows a PLA with three inputs, eight gates, and two outputs. On 
the left is a programmable AND array. The AND array is programmed by establish- 
ing a connection between any PLA input or its negation and any AND gate input by 
connecting the corresponding lines at their point of intersection. On the right is a 
programmable OR array, which involves connecting AND gate outputs to OR gate 
inputs. Most larger PLAs contain several hundred gates, 15 to 25 inputs, and 5 to 15 
outputs. The connections from the inputs to the AND gates, and from the AND 
gates to the OR gates, are not specified until programming time. 

PLAs are manufactured in two different ways to allow easy programming (mak- 
ing of connections). In the first, every possible connection is made through a fuse at 


CHAPTER 20 / DIGITAL LOGIC 


20-36 


(a) Layout for 3-input 2-output PLA 


AB + AC 


ABC + AB 


(b) Programmed PLA 


Figure 20.33 An Example of a Programmable Logic Array 
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every intersection point. The undesired connections can then be later removed by 
blowing the fuses. This type of PLA is referred to as a field-programmable logic array. 
Alternatively, the proper connections can be made during chip fabrication by using an 
appropriate mask supplied for a particular interconnection pattern. In either case, the 
PLA provides a flexible, inexpensive way of implementing digital logic functions. 
Figure 20.33b shows a programmed PLA that realizes two Boolean expressions. 


Field-Programmable Gate Array 


The PLA is an example of a simple PLD (SPLD). The difficulty with increasing ca- 
pacity of a strict SPLD architecture is that the structure of the programmable logic- 
planes grows too quickly in size as the number of inputs is increased. The only 
feasible way to provide large capacity devices based on SPLD architectures is then 
to integrate multiple SPLDs onto a single chip and provide interconnect to pro- 
grammably connect the SPLD blocks together. Many commercial PLD products 
exist on the market today with this basic structure, and are collectively referred to as 
Complex PLDs (CPLDs). The most important type of CPLD is the FPGA. 

An FPGA consists of an array of uncommitted circuit elements, called logic 
blocks, and interconnect resources. An illustration of a typical FPGA architecture is 
shown in Figure 20.34. The key components of an FGPA are; 


e Logic block: The configurable logic blocks are where the computation of the 
user’s circuit takes place. 


e I/O block: The I/O blocks connect I/O pins to the circuitry on the chip. 
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Figure 20.34 Structure of an FPGA 
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Figure 20.35 A Simple FPGA Logic Block 


e Interconnect: These are signal paths available for establishing connections 
among I/O blocks and logic blocks. 


The logic block can be either a combinational circuit or a sequential circuit. In 
essence, the programming of a logic block is done by downloading the contents of a 
truth table for a logic function. Figure 20.35 Shows an example of a simple logic 
block consisting of a D flip-flop, a 2-to-1 multiplexer, and a 16-bit lookup table. The 
lookup table is a memory consisting of 16 1-bit elements, so that 4 input lines are re- 
quired to select one of the 16 bits. Larger logic blocks have larger lookup tables and 
multiple interconnected lookup tables. The combinational logic realized by the 
lookup table can be output directly or stored in the D flip-flop and output synchro- 
nously. A separate one-bit memory controls the multiplexer to determine whether 
the output comes directly from the lookup table or from the flip-flop. 

By interconnecting numerous logic blocks, very complex logic functions can be 
easily implemented. 


20.6 RECOMMENDED READING AND WEB SITE 


[GREG98] is an easy-to-read introduction to the concepts of this chapter. [STON96] is an 
excellent short introduction. A number of textbooks provide more in-depth treatment; these 
include [MANO04] and [FARH04]. 

[BROW96] is a worthwhile tutorial on programmable logic devices. [LEON08] looks at 
recent developments in FPGA devices, platforms, and applications. 


BROW96_ Brown, S., and Rose, S. “Architecture of FPGAs and CPLDs: A Tutorial.” 
IEEE Design and Test of Computers, Vol. 13, No. 2, 1996. 

FARH04 Farhat, H. Digital Design and Computer Organization. Boca Ratan: CRC 
Press, 2004. 

GREG98_ Gregg, J. Ones and Zeros: Understanding Boolean Algebra, Digital Circuits, 
and the Logic of Sets. New York: Wiley, 1998. 


LEONO8 Leong, p. “Recent Trends in FPGA Architectures and Applications.” Proceedings, 
4" IEEE International symposium on Electronic Design, Test, and Applications, 2008. 

MANO04 Mano, M., and Kime, C. Logic and Computer Design Fundamentals. Upper 
Saddle River, NJ: Prentice Hall, 2004. 


STON96 Stonham, T. Digital Logic Techniques. London: Chapman & Hall, 1996. 
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Recommended Web site: 


e Digital Logic: Useful collection of circuit diagrams, with interactive features. 


20.7 KEY TERMS AND PROBLEMS 


Key Terms 


adder 

AND gate 

assert 

Boolean algebra 

clocked S—R flip-flop 

combinational circuit 

complex PLD (CPLD) 

counter 

decoder 

D flip-flop 

excitation table 

field-programmable gate array 
(FPGA) 

flip-flop 

gates 


graphical symbol 

J-K flip-flop 

Karnaugh map 

logic block 

lookup table 

multiplexer 

NAND gate 

NOR 

OR gate 

parallel register 

product of sums (POS) 

programmable array logic 
(PAL) 

programmable logic array 
(PLA) 


programmable logic device 
(PLD) 

Quine-McKluskey method 

read-only memory (ROM) 

register 

ripple counter 

sequential circuit 

shift register 

simple PLD (SPLD) 

sum of products (SOP) 

synchronous counter 

S-R Latch 

truth table 

XOR gate 


Problems 

20.1 Construct a truth table for the following Boolean expressions: 
a ABC+ABC |. 
b. ABC + ABC + ABC 


c. A(BC + BC) 
d. (A + B)(A + C)(A + B) 
20.2 Simplify the following expressions according to the commutative law: 
a. A‘B+B:-A+C:D-E+C:D-E+E:C:D 
b A‘B+A:C+B:A 
c. (L:M-N)(A:B)(C:D-E)(M-N-L) 
d. F-(K+R)+S-V+WwW-X+V-°S+ X:W+(R+K):F 
20.3. Apply DeMorgan’s theorem to the following equations: 
a F=V+A4+L 
b F=A+B+C+D 
20.4 Simplify the following expressions: 


a, 
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20.5 
20.6 


20.7 
20.8 


20.9 
20.10 
20.11 
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Combinational 
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digit | %3 
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(b) 
Figure 20.36 Seven-Segment LED Display Example 


f. A=Y:(W+X+Y+Z)-Z 

g A=(B:-E+C+F):C 

Construct the operation XOR from the basic Boolean operations AND, OR and NOT. 
Given a NOR gate and NOT gates, draw a logic diagram that will perform the three- 
input AND function. 

Write the Boolean expression for a four-input NAND gate. 

A combinational circuit is used to control a seven-segment display of decimal digits, 
as shown in Figure 20.36. The circuit has four inputs, which provide the four-bit code 
used in packed decimal representation (0;) = 0000,...,9;5 = 1001). The seven out- 
puts define which segments will be activated to display a given decimal digit. Note 
that some combinations of inputs and outputs are not needed. 

a. Develop a truth table for this circuit. 

b. Express the truth table in SOP form. 

c. Express the truth table in POS form. 

d. Provide a simplified expression. 

Design an 8-to-1 multiplexer. 

Add an additional line to Figure 20.15 so that it functions as a demultiplexer. 


The Gray code is a binary code for integers. It differs from the ordinary binary represen- 
tation in that there is just a single bit change between the representations of any two num- 
bers. This is useful for applications such as counters or analog-to-digital converters where 
a sequence of numbers is generated. Because only one bit changes at a time, there is never 
any ambiguity due to slight timing differences. The first eight elements of the code are 


Binary Code Gray Code 
000 000 
001 001 
010 011 
011 010 
100 110 
101 111 
110 101 
111 100 


Design a circuit that converts from binary to Gray code. 


20.12 


20.13 


20.14 


20.15 


20.16 


20.17 


20.18 
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Design a 5 X 32 decoder using four 3 X 8 decoders (with enable inputs) and one 
2 x 4 decoder. 

Implement the full adder of Figure 20.20 with just five gates. (Hint: Some of the gates 
are XOR gates.) 

Consider Figure 20.20. Assume that each gate produces a delay of 10 ns. Thus, the sum 
output is valid after 30 ns and the carry output after 0 ns. What is the total add time for 
a 32-bit adder 

a. Implemented without carry lookahead, as in Figure 20.19? 

b. Implemented with carry lookahead and using 8-bit adders, as in Figure 20.21? 

An alternative form of the S—R latch has the same structure as Figure 20.22 but uses 
NAND gates instead of NOR gates. 

a. Redo Table 20.10a and 20.10b for S—R latch implemented with NAND gates. 

b. Complete the following table, similar to Table 20.10c 


t 0 1 2 3 4 5 6 D 8 9 
S 0 1 1 1 1 1 0 1 0 1 
R 1 1 0 1 0 1 1 at 0 0 


Consider the graphic symbol for the S—-R flip-flop in Figure 20.27. Add additional 
lines to depict a D flip-flop wired from the S-R flip flop. 

Show the structure of a PLA with three inputs (C, B, A) and four outputs 
(Op, O;, On, O3), with the outputs defined as follows: 


O, = ABC + AB + ABC 
O, = ABC + ABC 
O,=C 

O, = AB + ABC 


An interesting application of a PLA is conversion from the old, obsolete punched 
cards character codes to ASCII codes. The standard punched cards that were so pop- 
ular with computers in the past had 12 rows and 80 columns where holes could be 
punched. Each column corresponded to one character, so each character had a 12-bit 
code. However, only 96 characters were actually used. Consider an application that 
reads punched cards and converts the character codes to ASCII. 

a. Describe a PLA implementation of this application. 

b. Can this problem be solved with a ROM? Explain. 
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KEY POINTS 


® The IA-64 instruction set architecture is a new approach to providing 
hardware support for instruction-level parallelism and is significantly 
different than the approach taken in superscalar architectures. 


® The most noteworthy features of the I[A-64 architecture are hardware 
support for predicated execution, control speculation, data speculation, 
and software pipelining. 


® With predicated execution, every [A-64 instruction includes a refer- 
ence to a 1-bit predicate register, and only executes if the predicate 
value is 1 (true). This enables the processor to speculatively execute 
both branches of an if statement and only commit after the condition 
is determined. 


® With control speculation, a load instruction is moved earlier in the pro- 
gram and its original position replaced by a check instruction. The 
early load saves cycle time; if the load produces an exception, the ex- 
ception is not activated until the check instruction determines if the 
load should have been taken. 


® With data speculation, a load is moved before a store instruction that 
might alter the memory location that is the source of the load. A subse- 
quent check is made to assure that the load receives the proper memory 
value. 


® Software pipelining is a technique in which instructions from multiple 
iterations of a loop are enabled to execute in parallel. 


With the Pentium 4, the microprocessor family that began with the 8086 and that 
has been the most successful computer product line ever appears to have come 
to an end. Intel has teamed up with Hewlett-Packard (HP) to develop a new 
64-bit architecture, called [A-64. IA-64 is not a 64-bit extension of Intel’s 32-bit 
x86 architecture, nor is it an adaptation of Hewlett-Packard’s 64-bit PA-RISC 
architecture. Instead, I[A-64 is a new architecture that builds on years of research 
at the two companies and at universities. The architecture exploits the vast cir- 
cuitry and high speeds available on the newest generations of microchips by a 
systematic use of parallelism. [A-64 architecture represents a significant depar- 
ture from the trend to superscalar schemes that have dominated recent proces- 
sor development. 

We begin this chapter with a discussion of the motivating factors for the new 
architecture. Next, we look at the general organization to support the architecture. We 
then examine in some detail the key features of the IA-64 architecture that promote 
instruction-level parallelism. Finally, we look at the [A-64 instruction set architecture 
and the Itanium organization. 
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21.1 MOTIVATION 


The basic concepts underlying IA-64 are: 


e Instruction-level parallelism that is explicit in the machine instructions rather 
than being determined at run time by the processor 


e Long or very long instruction words (LIW/VLIW) 
e Branch predication (not the same thing as branch prediction) 


e Speculative loading 


Intel and HP refer to this combination of concepts as explicitly parallel 
instruction computing (EPIC). Intel and HP use the term EPIC to refer to the tech- 
nology, or collection of techniques. [A-64 is an actual instruction set architecture 
that is intended for implementation using the EPIC technology. The first Intel prod- 
uct based on this architecture is referred to as Itanium. Other products will follow, 
based on the same JA-64 architecture. 

Table 21.1 summarizes key differences between IA-64 and a traditional super- 
scalar approach. 

For Intel, the move to a new architecture that is not hardware compatible with the 
x86 instruction architecture was a momentous decision. But it was driven by the dictates 
of the technology. When the x86 family began, back in the late 1970s, the processor chip 
had tens of thousands of transistors and was an essentially scalar device. That is, in- 
structions were processed one at a time, with little or no pipelining. As the number of 
transistors increased into the hundreds of thousands in the mid-1980s, Intel introduced 
pipelining (e.g., Figure 12.19). Meanwhile, other manufacturers were attempting to 
take advantage of the increased transistor count and increased speed by means of the 
RISC approach, which enabled more effective pipelining, and later the superscalar/ 
RISC combination, which involved multiple execution units. With the Pentium, Intel 
made a modest attempt to use superscalar techniques, allowing two CISC instructions 
to execute at a time. Then, the Pentium Pro and Pentium II through Pentium 4 incor- 
porated a mapping from CISC instructions to RISC-like micro-operations and the 


Table 21.1 Traditional Superscalar versus [A-64 Architecture 


Superscalar IA-64 


RISC-like instructions, one per word RISC-like instructions bundled into groups 
of three 


Multiple parallel execution units Multiple parallel execution units 


Reorders and optimizes instruction stream at Reorders and optimizes instruction stream 
run time at compile time 


Branch prediction with speculative execution Speculative execution along both paths of 
of one path a branch 


Loads data from memory only when needed, Speculatively loads data before its needed, and 
and tries to find the data in the caches first still tries to find data in the caches first 
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more aggressive use of superscalar techniques. This approach enabled the effective use 
of a chip with millions of transistors. But for the next generation processor, the one be- 
yond Pentium, Intel and other manufacturers are faced with the need to use effectively 
tens of millions of transistors on a single processor chip. 

Processor designers have few choices in how to use this glut of transistors. One 
approach is to dump those extra transistors into bigger on-chip caches. Bigger 
caches can improve performance to a degree but eventually reach a point of dimin- 
ishing returns, in which larger caches result in tiny improvements in hit rates. An- 
other approach is to provide for multiple processors on a single chip. This approach 
is discussed in Chapters 2 and 16. Yet another alternative is to increase the degree of 
superscaling by adding more execution units. The problem with this approach is that 
designers are, in effect, hitting a complexity wall. As more and more execution units 
are added, making the processor “wider,” more logic is needed to orchestrate these 
units. Branch prediction must be improved, out-of-order processing must be used, 
and longer pipelines must be employed. But with more and longer pipelines, there is 
a greater penalty for misprediction. Out-of-order execution requires a large number 
of renaming registers and complex interlock circuitry to account for dependencies. 
As a result, today’s best processors can manage at most to retire six instructions per 
cycle, and usually less. 

To address these problems, Intel and HP have come up with an overall design 
approach that enables the effective use of a processor with many parallel execution 
units. The heart of this new approach is the concept of explicit parallelism. With this 
approach, the compiler statically schedules the instructions at compile time, rather 
than having the processor dynamically schedule them at run time. The compiler de- 
termines which instructions can execute in parallel and includes this information 
with the machine instruction. The processor uses this information to perform paral- 
lel execution. One advantage of this approach is that the EPIC processor does not 
need as much complex circuitry as an out-of-order superscalar processor. Further, 
whereas the processor has only a matter of nanoseconds to determine potential par- 
allel execution opportunities, the compiler has orders of magnitude more time to ex- 
amine the code at leisure and see the program as a whole. 


21.2 GENERAL ORGANIZATION 


As with any processor architecture, [A-64 can be implemented in a variety of orga- 
nizations. Figure 21.1 suggests in general terms the organization of an IA-64 ma- 
chine. The key features are: 


e Large number of registers: The [A-64 instruction format assumes the use of 256 
registers: 128 64-bit registers for integer, logical, and general-purpose use, and 
128 82-bit registers for floating-point and graphic use. There are also 64 1-bit 
predicate registers used for predicated execution, as explained subsequently. 


e Multiple execution units: A typical commercial superscalar machine today 
may support four parallel pipelines, using four parallel execution units in both 
the integer and floating-point portions of the processor. It is expected that 
1A-64 will be implemented on systems with eight or more parallel units. 


21.2 / GENERAL ORGANIZATION 21-5 


“<7OozxmES 


GR = General-purpose or integer register 
FR = Floating-point or graphics register 
PR = One-bit predicate register 

EU = Execution unit 


Figure 21.1 General Organization for I[A-64 
Architecture 


The register file is quite large compared with most RISC and superscalar ma- 
chines. The reason for this is that a large number of registers is needed to support a 
high degree of parallelism. In a traditional superscalar machine, the machine lan- 
guage (and the assembly language) employs a small number of visible registers, and 
the processor maps these onto a larger number of registers using register renaming 
techniques and dependency analysis. Because we wish to make parallelism explicit 
and relieve the processor of the burden of register renaming and dependency analy- 
sis, we need a large number of explicit registers. 

The number of execution units is a function of the number of transistors avail- 
able in a particular implementation. The processor will exploit parallelism to the ex- 
tent that it can. For example, if the machine language instruction stream indicates that 
eight integer instructions may be executed in parallel, a processor with four integer 
pipelines will execute these in two chunks. A processor with eight pipelines will exe- 
cute all eight instructions simultaneously. 

Four types of execution unit are defined in the [A-64 architecture: 


e J-unit: For integer arithmetic, shift-and-add, logical, compare, and integer mul- 
timedia instructions 


e M-unit: Load and store between register and memory plus some integer ALU 
operations 


e B-unit: Branch instructions 


e F-unit: Floating-point instructions 


Each IA-64 instruction is categorized into one of six types. Table 21.2 lists 
the instruction types and the execution unit types on which they may be executed. 
The extended (X) instruction type includes instructions in which two slots in a 
bundle are used to encode the instruction, allowing for more information than fits 
into a 41-bit instruction (slots and bundles are explained in the next section). 
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Table 21.2 Relationship Between Instruction Type and 
Execution Unit Type 


Instruction Type Description Execution Unit Type 


A Integer ALU T-unit or M-unit 
Non-ALU integer T-unit 


Memory M-unit 


Branch B-unit 


I 
M 
F Floating-point F-unit 
B 
x 


Extended J-unit/B-unit 


21.3 PREDICATION, SPECULATION, AND SOFTWARE PIPELINING 


This section looks at the key features of the IA-64 architecture that support 
instruction-level parallelism. First, we need to provide an overview of the IA-64 
instruction format and, to support the examples in this section, define the general 
format of IA-64 assembly language instructions. 


Instruction Format 


1A-64 defines a 128-bit bundle that contains three instructions, called syllables, and 
a template field (Figure 21.2a). The processor can fetch instructions one or more 
bundles at a time; each bundle fetch brings in three instructions. The template field 


~ 128-bit bundle > 
2 A . Tem- 
Instruction slot 2 Instruction slot 1 Instruction slot 0 plate 

41 41 41 5 


(a) IA-64 bundle 


< 41-bit instruction > 
Major PR 
opcode 
4 31 6 


(b) General IA-64 instruction format 


Don Romenmoditeine nite GR3 GR2 GR1 PR 
opcode 
4 10 7 7 7 6 


(c) Typical IA-64 instruction format 


PR = Predicate register 
GR = General or floating-point register 


Figure 21.2 IA-64 Instruction Format 
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contains information that indicates which instructions can be executed in parallel. 
The interpretation of the template field is not confined to a single bundle. Rather, 
the processor can look at multiple bundles to determine which instructions may be 
executed in parallel. For example, the instruction stream may be such that eight 
instructions can be executed in parallel. The compiler will reorder instructions so 
that these eight instructions span contiguous bundles and set the template bits so 
that the processor knows that these eight instructions are independent. 

The bundled instructions do not have to be in the original program order. Fur- 
ther, because of the flexibility of the template field, the compiler can mix independent 
and dependent instructions in the same bundle. Unlike some previous VLIW designs, 
1A-64 does not need to insert null-operation (NOP) instructions to fill in the bundles. 

Table 21.3 shows the interpretation of the possible values for the 5-bit tem- 
plate field (some values are reserved and not in current use). The template value 
accomplishes two purposes: 


Table 21.3 Template Field Encoding and Instruction 
Set Mapping 


Template 


00 
01 
02 
03 
04 
05 
08 
09 
0A 
0B 
0C 
0D 
OE 
OF 
10 
11 
12 
13 
16 
17 
18 
19 
1C 
1D 
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1. The field specifies the mapping of instruction slots to execution unit types. Not 
all possible mappings of instructions to units are available. 


2. The field indicates the presence of any stops. A stop indicates to the hardware 
that one or more instructions before the stop may have certain kinds of re- 
source dependencies with one or more instructions after the stop. In the table, 
a heavy vertical line indicates a stop. 


Each instruction has a fixed-length 41-bit format (Figure 21.2b). This is some- 
what longer than the traditional 32-bit length found on RISC and RISC superscalar 
machines (although it is much shorter than the 118-bit micro-operation of the Pen- 
tium 4). Two factors lead to the additional bits. First, [A-64 makes use of more regis- 
ters than a typical RISC machine: 128 integer and 128 floating-point registers. 
Second, to accommodate the predicated execution technique, an IA-64 machine in- 
cludes 64 predicate registers. Their use is explained subsequently. 

Figure 21.2c shows in more detail the typical instruction format. All instruc- 
tions include a 4-bit major opcode and a reference to a predicate register. Although 
the major opcode field can only discriminate among 16 possibilities, the interpretation 
of the major opcode field depends on the template value and the location of the in- 
struction within a bundle (Table 21.3), thus affording more possible opcodes. Typical 
instructions also include three fields to reference registers, leaving 10 bits for other 
information needed to fully specify the instruction. 


Assembly-Language Format 


As with any machine instruction set, an assembly language is provided for the con- 
venience of the programmer. The assembler or compiler then translates each assem- 
bly language instruction into a 41-bit [A-64 instruction. The general format of an 
assembly language instruction is: 


[gp] mnemonic|.comp] dest = srcs 
where 


gp Specifies a 1-bit predicate register used to qualify the instruction. If 
the value of the register is 1 (true) at execution time, the instruction 
executes and the result is committed in hardware. If the value is false, 
the result of the instruction is not committed but is discarded. Most 
IA-64 instructions may be qualified by a predicate but need not be. 
To account for an instruction that is not predicated, the qp value is set 
to 0 and predicate register zero always has the constant value of 1. 


mnemonic — Specifies the name of an IA-64 instruction. 


comp Specifies one or more instruction completers, separated by periods, 
which are used to qualify the mnemonic. Not all instructions require 
the use of a completer. 


dest Specifies one or more destination operands, with the typical case 
being a single destination. 


STCS Specifies one or more source operands. Most instructions have two or 
more source operands. 
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On any line, any characters to the right of a double slash “//” are treated as a 
comment. Instruction groups and stops are indicated by a double semicolon “;;”. An 
instruction group is defined as a sequence of instructions that have no read after 
write or write after write dependencies. The processor can issue these without hard- 
ware checks for register dependencies. Here is a simple example: 


1d8 rl 
add r3 


[eb] Fy // First group 


1, ra // Second group 


The first instruction reads an 8-byte value from the memory location whose 
address is in register r5 and then places that value in register r1. The second instruc- 
tion adds the contents of rl and r4 and places the result in r3. Because the second in- 
struction depends on the value in rl, which is changed by the first instruction, the 
two instructions cannot be in the same group for parallel execution. 

Here is a more complex example, with multiple register flow dependencies: 


1d8 ri -= [rd] // First group 
sub r6 = r8, r9 ;; // First group 
add r3 = rl, r4 // Second group 
st8 [r6] = r12 // Second group 


The last instruction stores the contents of r12 in the memory location whose 
address is in r6. 

We are now ready to look at the four key mechanisms in the IA-64 architec- 
ture to support instruction-level parallelism: 


e Predication 

e Control speculation 
e Data speculation 

e Software pipelining 


Figure 21.3, based on a figure in [HALF97], illustrates the first two of these tech- 
niques, which are discussed in this subsection and the next. 


Predicated Execution 


Predication is a technique whereby the compiler determines which instructions may 
execute in parallel. In the process, the compiler eliminates branches from the pro- 
gram by using conditional execution. A typical example in a high-level language is 
an if-then-else instruction. A traditional compiler inserts a conditional branch at the 
if point of this construct. If the condition has one logical outcome, the branch is not 
taken and the next block of instructions is executed, representing the then path; at the 
end of this path is an unconditional branch around the next block, representing the 
else path. If the condition has the other logical outcome, the branch is taken around 
the then block of instructions and execution continues at the else block of instruc- 
tions. The two instruction streams join together after the end of the else block. An 
1A-64 compiler instead does the following (Figure 21.3a): 
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The compiler might rearrange instructions in this order, pairing 
instructions 4 and 7, 5 and 8, and 6 and 9 for parallel execution. 


Instruction 1 | Instruction 2 | Instruction 3 


Instruction 4 | Instruction 7 | Instruction 5 


Instruction 8 | Instruction 6 | Instruction 9 


Figure 21.3 
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(b) Speculative loading 
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1. At the if point in the program, insert a compare instruction that creates two 
predicates. If the compare is true, the first predicate is set to true and the sec- 
ond to false; if the compare is false, the first predicate is set to false and the sec- 
ond to true. 


nN 


. Augment each instruction in the then path with a reference to a predicate regis- 
ter that holds the value of the first predicate, and augment each instruction in the 
else path with a reference to a predicate register that holds the value of the sec- 
ond predicate. 


3. The processor executes instructions along both paths. When the outcome of 
the compare is known, the processor discards the results along one path and 
commits the results along the other path. This enables the processor to feed in- 
structions on both paths into the instruction pipeline without waiting for the 
compare operation to complete. 


As an example, consider the following source code: 


if (a&&b) 
DSS) Ly 
else 
Source Code: LE <¢e) 
k=k+1; 
else 
k = k - 1; 
1 =i +1; 


Two if statements jointly select one of three possible execution paths. This can 
be compiled into the following code, using the Pentium assembly language. The pro- 
gram has three conditional branches and one unconditional branch instructions: 


cmp a, 0O ; compare a with 0 


je L1 ; branch to L1 if a = 0 
cmp b, 0 
je L1 
add j, 1 7 Seeger 1 
Assembly Code: gee 3 
Li: cmp c, 0O 
je L2 
add k, 1 7 k=ak+i1 
jmp L3 
L2: sub k, 1 >; k=k-t1 
L3: add i, 1 7 ld =i+tl 


In the Pentium assembly language, a semicolon is used to delimit a comment. 
Figure 21.4 shows a flow diagram of this assembly code. This diagram breaks 
the assembly language program into separate blocks of code. For each block that 
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cmp a, 0 
je L1 
| T 
cmp b, 0 
je L1 ie 
F 
T 
p 3 L1: cmp ¢, 0 1 
add j,1 P jeL2 p 
a a4 
add k, 1 p4 12: sub k, 1 ps 


Figure 21.4 Example of Predication 


executes conditionally, the compiler can assign a predicate. These predicates are in- 
dicated in Figure 21.4. Assuming that all of these predicates have been initialized to 
false, the resulting I[A-64 assembly code is as follows: 


(lh) cmp.eq pl, p2 = 0, a ;; 
(2) (p2) cmp.eq pl, p3 = 0, b 
(3) (p3) add j =1, j 

Predicated Code (4) (pl) cmp.ne p4, p5 = 0, c 
(5) (p4) add k = 1, k 
(6) (p5) add k = -1, k 
(7) add i=l, i 


Instruction (1) compares the contents of symbolic register a with 0; it sets the 
value of predicate register p1 to 1 (true) and p2 to 0 (false) if the relation is true and 
will set the value of predicate p1 to 0 and p2 to 1 if the relation is false. Instruction 
(2) is to be executed only if the predicate p2 is true (i.e., if a is true, which is equiva- 
lent to a # 0). The processor will fetch, decode, and begin executing this instruction, 
but only make a decision as to whether to commit the result after it determines 
whether the value of predicate register pl is 1 or 0. Note that instruction (2) is a 
predicate-generating instruction and is itself predicated. This instruction requires 
three predicate register fields in its format. 
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Returning to our Pentium program, the first two conditional branches in the 
Pentium assembly code are translated into two IA-64 predicated compare in- 
structions. If instruction (1) sets p2 to false, the instruction (2) is not executed. 
After instruction (2) in the [A-64 program, p3 is true only if the outer if state- 
ment in the source code is true. That is, predicate p3 is true only if the expression 
(a AND b) is true (i.e.,a # O AND b # 0). The then part of the outer if statement 
is predicated on p3 for this reason. Instruction (4) of the IA-64 code decides 
whether the addition or subtraction instruction in the outer else part is per- 
formed. Finally, the increment of i is performed unconditionally. Looking at the 
source code and then at the predicated code, we see that only one of instructions 
(3), (5), and (6) is to be executed. In an ordinary superscalar processor, we would 
use branch prediction to guess which of the three is to be executed and go down 
that path. If the processor guesses wrong, the pipeline must be flushed. An IA-64 
processor can begin execution of all three of these instructions and, once the val- 
ues of the predicate registers are known, commit only the results of the valid in- 
struction. Thus, we make use of additional parallel execution units to avoid the 
delays due to pipeline flushing. 

Much of the original research on predicated execution was done at the Uni- 
versity of Illinois. Their simulation studies indicate that the use of predication results 
in a substantial reduction in dynamic branches and branch mispredictions and a sub- 
stantial performance improvement for processors with multiple parallel pipelines 
(e.g., [MAHL94], [MAHL95]). 


Control Speculation 


Another Key innovation in IA-64 is control speculation, also known as speculative 
loading. This enables the processor to load data from memory before the program 
needs it, to avoid memory latency delays. Also, the processor postpones the report- 
ing of exceptions until it becomes necessary to report the exception. The term hoist 
is used to refer to the movement of a load instruction to a point earlier in the 
instruction stream. 

The minimization of load latencies is crucial to improving performance. Typi- 
cally, early in a block of code, there are a number of load operations that bring data 
from memory to registers. Because memory, even augmented with one or two lev- 
els of cache, is slow compared with the processor, the delays in obtaining data from 
memory become a bottleneck. To minimize this, we would like to rearrange the 
code so that loads are done as early as possible. This can be done with any com- 
piler, up to a point. The problem occurs if we attempt to move a load across a con- 
trol flow. You cannot unconditionally move the load above a branch because the 
load may not actually occur. We could move the load conditionally, using predi- 
cates, so that the data could be retrieved from memory but not committed to an ar- 
chitectural register until the outcome of the predicate is known; or we can use 
branch prediction techniques of the type we saw in Chapter 14. The problem with 
this strategy is that the load can blow up. An exception due to invalid address or a 
page fault could be generated. If this happens, the processor would have to deal 
with the exception or fault, causing a delay. 
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How then, can we move the load above the branch? The solution specified in 
1A-64 is the control speculation, which separates the load behavior (delivering the 
value) from the exception behavior (Figure 21.3b). A load instruction in the original 
program is replaced by two instructions: 


e A speculative load (ld.s) executes the memory fetch, performs exception de- 
tection, but does not deliver the exception (call the OS routine that handles 
the exception). This ld.s instruction is hoisted to an appropriate point earlier in 
the program. 


e A checking instruction (chk.s) remains in the place of the original load and de- 
livers exceptions. This chk.s instruction may be predicated so that it will only 
execute if the predicate is true. 


If the Id.s detects an exception, it sets a token bit associated with the target 
register, known as the Not a Thing (NaT) bit. If the corresponding chk.s instruction 
is executed, and if the NaT bit is set, the chk.s instruction branches to an exception- 
handling routine. 

Let us look at a simple example, taken from [INTE00a, Volume 1]. Here is the 
original program: 


(p1) br some_label // Cycle 0 
1d8: #i- = [eS] ae // Cycle 1 
add: #2 = 41. 73 // Cycle 3 


The first instruction branches if predicate p1 is true (register p1 has value 1). Note 
that the branch and load instructions are in the same instruction group, even though the 
load should not execute if the branch is taken. IA-64 guarantees that if a branch is 
taken, later instructions, even in the same instruction group, are not executed. IA-64 
implementations may use branch prediction to try to improve efficiency but must as- 
sure against incorrect results. Finally, note that the add instruction is delayed by at least 
a clock period (one cycle) due to the memory latency of the load operation. 

The compiler can rewrite this code using a control speculative load and a check: 


1d8.s rl = [r5] ;; // Cycle -2 
// Other instructions 

(p1) br some_label // Cycle 0 
chk.s rl, recovery // Cycle 0 
add r2 = rl, 4x3 // Cycle 0 


We can’t simply move the load instruction above the branch instruction, as is, 
because the load instruction may cause an exception (e.g., r5 may contain a null 
pointer). Instead, we convert the load to a speculative load, ld8.s, and then move it. 
The speculative load doesn’t immediately signal an exception when detected; it just 
records that fact by setting the NaT bit for the target register (in this case, rl). The 
speculative load now executes unconditionally at least two cycles prior to the 
branch. The chk.s instruction then checks to see if the NaT bit is set on r1. If not, ex- 
ecution simply falls through to the next instruction. If so, a branch is taken to a 
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recovery program. Note that the branch, check, and add instructions are all shown 
as being executed in the same clock cycle. However, the hardware ensures that the 
results produced by the speculative load do not update the application state (change 
the contents of rl and r2) unless two conditions occur: the branch is not taken 
(p1 = 0) and the check does not detect a deferred exception (r1.NaT = 0). 

There is one other important point to note about this example. If there is no 
exception, then the speculative load is an actual load and takes place prior to the 
branch that it is supposed to follow. If the branch is taken, then a load has occurred 
that was not intended by the original program. The program, as written, assumes that 
r1 is not read on the taken-branch path. If rl is read on the taken-branch path, then 
the compiler must use another register to hold the speculative result. 

Let us look at a more complex example, used by Intel and HP to benchmark 
predicated programs and to illustrate the use of speculative loads, known as the 
Eight Queens Problem. The objective is to arrange eight queens on a chessboard so 
that no queen threatens any other queen. Figure 21.5a shows one solution. The key 
line of source code, in an inner loop, is the following: 


where 1 =i,j = 8. 


a array 
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(b) b and c arrays 
Figure 21.5 The Eight Queens Problem 
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The queen conflict tracking mechanism consists of three Boolean arrays that 
track queen status for each row and diagonal. TRUE means no queen is on that row 
or diagonal; FALSE means a queen is already there. Figures 21.5b and c show the 
mapping of the arrays to the chessboard. All array elements are initialized to 
TRUE. The B array elements 1 through 8 correspond to rows 1 through 8 on the 
board. A queen in row n sets b[n] to FALSE. C array elements are numbered from 
—7 to 7 and correspond to the difference between column and row numbers, which 
defines the diagonals that go down to the right. A queen at column 1, row 1 sets c[0] 
to FALSE. A queen at column 1, row 8 sets c[—7] to FALSE. The A array elements 
are numbered 2-16 and correspond to the sum of the column and row. A queen 
placed in column 1, row 1 sets a[2] to FALSE. A queen placed in column 3, row 5 
sets a[8] to FALSE. 

The overall program moves through the columns, placing a queen on each col- 
umn such that the new queen is not attacked by a queen previously placed on either 
along a row or one of the two diagonals. 

A straightforward Pentium assembly program includes three loads and three 
branches: 


(1 mov r2, &b[j] ; transfer contents of 
location 
; b[j] to register r2 
(2 cinp:-"F2>5 
(3) jne L2 
(4 mov r4, &af[i + j] 
Assembly (5) cmp r4, 
Code: (6 jne L2 
(7) mov r6, &c[i - j] 
(8 cmp r6, 1 
(9) jne L2 
(10) Ll: <code for then path> 
(11) L2: <code for else path> 


In the preceding program, the notation &x symbolizes an immediate address 
for location x. 
Using speculative loads and predicated execution yields the following: 


(1) mov rl = &b[j] // transfer 
address of 
// b[j] to rl 


(2) mov r3 = é&a[i + j] 

(3) mov r5 = &c[i - j + 7] 

(4) 1d8 r2 = [r1] // load indirect 
via rl 

(5) 1d8.s r4 = [r3] 

(6) 1d8.s r6 = [r5] 
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Code with (7) cmp.eq pl, p2 = 1, r2 
Speculation and (8) (p2) br L2 
Predication: (9) chk.s r4, recovery_a // fixup for 
loading a 
(10) cmp.eq p3, p4 = 1, r4 
(11) (p4) br L2 
(12) chk.s r6, recovery_b // fixup for 
loading b 
(13) cmp.eq p5, p6é = 1, r5 
(14) (p6) br L2 
(15) Lil: <code for then path> 
(16) L2: <code for else path> 


The assembly program breaks down into three basic blocks of code, each of 
which is a load followed by a conditional branch. The address-setting instructions 4 
and 7 in the Pentium assembly code are simple arithmetic calculations; these can be 
done anytime, so the compiler moves these up to the top. Then the compiler is faced 
with three simple blocks, each of which consists of a load, a condition calculation, 
and a conditional branch. There seems little hope of doing anything in parallel here. 
Furthermore, if we assume that the load takes two or more clock cycles, we have 
some wasted time before the conditional branch can be executed. What the com- 
piler can do is hoist the second and third loads (instructions 5 and 8 in the Pentium 
code) above all the branches. This is done by putting a speculative load up top 
(1A-64 instructions 5 and 6) and leaving a check in the original code block (IA-64 
instructions 9 and 12). 

This transformation makes it possible to execute all three loads in parallel and 
to begin the loads early so as to minimize or avoid delays due to load latencies. The 
compiler can go further by more aggressive use of predication, and eliminate two of 
the three branches: 


(1) mov rl = &b[j] 

(2) mov r3 = &a[i + j] 

(3) mov r5 = &c[i - j + 7] 

(4) 1d8 r2 = [r1] 

(5) 1d8.s r4 = [r3] 
Revised Code with (6) 1d8.s r6 = [r5] 
Speculation and (7) cmp.eq pl, p2 = 1, r2 
Predication: (8) (pl) chk.s r4, recovery_a 

(9) (pl) cmp.eq p3, p4 = 1, r4 

(10) (p3) chk.s r6, recovery_b 

(11) (p3) cmp.eq p5, p4 = 1, r5 

(12) (p6) br L2 

(13) Ll: <code for then path> 

(14) L2 <code for else path> 
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We already had a compare that generated two predicates. In the revised code, in- 
stead of branching on the false predicate, the compiler qualifies execution of both 
the check and the next compare on the true predicate. The elimination of two 
branches means the elimination of two potential mispredictions, so that the savings 
is more than just two instructions. 


Data Speculation 


In a control speculation, a load is moved earlier in a code sequence to compensate 
for load latency, and a check is made to assure that an exception doesn’t occur if it 
subsequently turns out that the load was not taken. In data speculation, a load is 
moved before a store instruction that might alter the memory location that is the 
source of the load. A subsequent check is made to assure that the load receives the 
proper memory value. To explain the mechanism, we use an example taken from 
[INTE00a, Volume 1]. 
Consider the following program fragment: 


st8 [r4] = r12 // Cycle 0 
1d8 r6 = [r8] ;; // Cycle 0 
add r5 = r6, r7 ;; // Cycle 2 
st8 [r18] = r5 // Cycle 3 


As written, the code requires four instruction cycles to execute. If registers r4 
and r8 do not contain the same memory address, then the store through r4 cannot 
affect the value at the address contained in r8; under this circumstance, it is safe to 
reorder the load and store to more quickly bring the value into r6, which is needed 
subsequently. However, because the addresses in r4 and r8 may be the same or over- 
lap, such a swap is not safe. [A-64 overcomes this problem with the use of a tech- 
nique known as advanced load. 


1d8.a r6 = [r8] ;; // Cycl 2 or earlier; 
advanced load 


// other instructions 


st8 [r4] = r12 // Cycle 0 
1d8.c r6 = [r8] // Cycle 0; check load 
addrrS) = 46-, °F // Cycle 0 
st8 [r18] = r5 {/-Cyele. 1 


Here we have moved the Id instruction earlier and converted it into an ad- 
vanced load. In addition to performing the specified load, the I1d8.a instruction 
writes its source address (address contained in r8) to a hardware data structure 
known as the Advanced Load Address Table (ALAT). Each IA-64 store instruc- 
tion checks the ALAT for entries that overlap with its target address; if a match is 
found, the ALAT entry is removed. When the original Id8 is converted to an Id8.a 
instruction and moved, the original position of that instruction is replaced with a 
check load instruction, ld8.c. When the check load is executed, it checks the 
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ALAT for a matching address. If one is found, no store instruction between the 
advanced load and the check load has altered the source address of the load, and 
no action is taken. However, if the check load instruction does not find a match- 
ing ALAT entry, then the load operation is performed again to assure the correct 
result. 

We may also want to speculatively execute instructions that are data depen- 
dent on a load instruction, together with the load itself. Starting with the same 
original program, suppose we move up both the load and the subsequent add 
instruction: 


1d8.a r6 = [r8] ;; // Cycle -3 or earlier; 
advanced load 

// other instructions 

add r5 = r6, r7 // Cycle -1; add that uses r6 

// other instructions 

st8 [r4] = r12 // Cycle 0 

chk.a r6, recover // Cycle 0; check 

back: // return point from jump to 

recover 

st8 [r18] = r5 // Cycle 0 


Here we use a chk.a instruction rather than an ld8.c instruction to validate 
the advanced load. If the chk.a instruction determines that the load has failed, it 
cannot simply reexecute the load; instead, it branches to a recovery routine to 
clean up: 


Recover: 
1d8 r6 = [r8] ;; // veload r6 from [r8] 
add--75sS'r6). 67 36 // xecute the add 
br back // jump back to main code 


This technique is effective only if the loads and stores involved have little 
chance of overlapping. 


Software Pipelining 


Consider the following loop: 


L1: 1d4 r4 = [r5], 4 ;; // Cycle 0; load postinc 4 
add r7 = r4, x9 ;; // Cycle 2 
st4 [r6] = r7, 4 // Cycle 3; store postinc 4 
br.cloop Ll ;; // Cycle 3 


This loop adds a constant to one vector and stores the result in another vector 
(e.g. y[i] = x[i] + c). The ld4 instruction loads 4 bytes from memory. The qualifier 
“4” at the end of the instruction signals that this is the base update form of the load 
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instruction; the address in r5 is incremented by 4 after the load takes place. Similarly, 
the st4 instruction stores four bytes in memory and the address in r6 is incremented 
by four after the store. The br.cloop instruction, known as a counted loop branch, 
uses the Loop Count (LC) application register. If the LC register is greater than 
zero, it is decremented and the branch is taken. The initial value in LC is the number 
of iterations of the loop. 

Notice that in this program, there is virtually no opportunity for instruction- 
level parallelism within a loop. Further, the instructions in iteration x are all exe- 
cuted before iteration x + 1 begins. However, if there is no address conflict 
between the load and store (15 and r6 point to nonoverlapping memory loca- 
tions), then utilization could be improved by moving independent instructions 
from iteration x + 1 to iteration x. Another way of saying this is that if we unroll 
the loop code by actually writing out a new set of instructions for each iteration, 
then there is opportunity to increase parallelism. Let’s see what could be done 
with five iterations: 


1d4 r32 = [r5], 4 ;; // Cycle 0 
1d4 r33 = [r5], 4 ;; // Cycle 1 
1d4 34 = [r5], :4 // Cycle 2 
add r36 = r32, r9 ;; // Cycle 2 
Ld4- #35 = prs], 4 // Cycle 3 
add r37 = r33, r9 // Cycle 3 
st4 [r6] = r36, 4 ;; // Cycle 3 
1d4 r36 = [r5], 4 // Cycle 3 
add r38 = r34, r9 // Cycle 4 
st4 [r6] = r37, 4 ;; // Cycle 4 
add r39 = r35, r9 // Cycle 5 
st4 [r6] = r38, 4 ;; // Cycle 5 
add r40 = r36, r9 // Cycle 6 
st4 [r6] = r39, 4 ;; // Cycle 6 
st4 [r6] = r40, 4 ;; // Cycle 7 


This program completes 5 iterations in 7 cycles, compared with 20 cycles in the 
original looped program. This assumes that there are two memory ports so that a 
load and a store can be executed in parallel. This is an example of software pipelin- 
ing, analogous to hardware pipelining. Figure 21.6 illustrates the process. Parallelism 
is achieved by grouping together instructions from different iterations. For this to 
work, the temporary registers used inside the loop must be changed for each itera- 
tion to avoid register conflicts. In this case, two temporary registers are used (r4 and 
r7 in the original program). In the expanded program, the register number of each 
register is incremented for each iteration, and the register numbers are initialized 
sufficiently far apart to avoid overlap. 

Figure 21.6 shows that the software pipeline has three phases. During the prolog 
phase, a new iteration is initiated with each clock cycle and the pipeline gradually fills 
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Figure 21.6 Software Pipelining Example 


up. During the kernel phase, the pipeline is full, achieving maximum parallelism. For 
our example, three instructions are performed in parallel during the kernel phase, but 
the width of the pipeline is four. During the epilog phase, one iteration completes with 
each clock cycle. 

Software pipelining by loop unrolling places a burden on the compiler or 
programmer to assign register names properly. Further, for long loops with many 
iterations, the unrolling results in a significant expansion in code size. For an inde- 
terminate loop (total iterations unknown at compile time), the task is further com- 
plicated by the need to do a partial unroll and then to control the loop count. IA-64 
provides hardware support to perform software pipelining with no code expansion 
and with minimal burden on the compiler. The key features that support software 
pipelining are: 


e Automatic register renaming: A fixed-sized area of the predicate and floating- 
point register files (p16 to p63; fr32 to fr127) and a programmable-sized area 
of the general register file (maximum range of r32 to r127) are capable of 
rotation. This means that during each iteration of a software-pipeline loop, 
register references within these ranges are automatically incremented. Thus, if 
a loop makes use of general register r32 on the first iteration, it automatically 
makes use of r33 on the second iteration, and so on. 


e Predication: Each instruction in the loop is predicated on a rotating predicate 
register. The purpose of this is to determine whether the pipeline is in prolog, 
kernel, or epilog phase, as explained subsequently. 


e Special loop terminating instructions: These are branch instructions that cause 
the registers to rotate and the loop count to decrement. 


This is a relatively complex topic; here, we present an example that illustrates 
some of the IA-64 software pipelining capabilities. We take the original loop program 
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from this section and show how to program it for software pipelining, assuming a 
loop count of 200 and that there are two memory ports: 


mov lc = 199 // set loop count 
register to 199, 
// which equals loop 
count - 1 
mov ec = 4 // set epilog count 
register equal 
// to number of epilog 
stages + 1 


mov pr.rot = 1<<16;; // pri6é = 1; rest = 0 


L1: (p16) Waa) 3 9)\= [ae sjis A // Cycle 0 
(p17) —-- // Empty stage 
(p18) add r35 = r34, r9 // Cycle 0 
(p19) st4 [r6] = r36, 4 // Cycle 0 

br.ctop Ll ;; // Cycle 0 


We summarize the key points related to this program: 


. The loop body is partitioned into multiple stages, with zero or more instruc- 


tions per stage. 


. Execution of the loop proceeds through three phases. During the prolog phase, a 


new loop iteration is started each time around, adding one stage to the pipeline. 
During the kernel phase, one loop iteration is started and one completed each 
time around; the pipeline is full, with the maximum number of stages active. Dur- 
ing the epilog phase, no new iterations are started and one iteration is completed 
each time around, draining the software pipeline. 


. A predicate is assigned to each stage to control the activation of the instructions 


in that stage. During the prolog phase, p16 is true and p17, p18, and p19 are false 
for the first iteration. For the second iteration, p16 and p17 are true; during the 
third iteration p16, p17, and p18 are true. During the kernel phase, all predicates 
are true. During the epilog phase, the predicates are turned to false one by one, 
beginning with p16. The changes in predicate values are achieved by predicate 
register rotation. 


. All general registers with register numbers greater than 31 are rotated with each 


iteration. Registers are rotated toward larger register numbers in a wraparound 
fashion. For example, the value in register x will be located in register x + 1 after 
one rotation; this is achieved not by moving values but by hardware renaming of 
registers. Thus, in our example, the value that the load writes in r32 is read by the 
add two iterations (and two rotations) later as r34. Similarly the value that the 
add writes in r35 is read by the store one iteration later as r36. 


. For the br.ctop instruction, the branch is taken if either LC > 0 or EC > 1. 


Execution of br.ctop has the following additional effects: If LC > 0, then LC 
is decremented; this happens during the prolog and kernel phases. If LC = 0 
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Table 21.4 Loop Trace for Software Pipelining Example 


Execution Unit/Instruction State before br.ctop 


M B p1s p19 
br.ctop 


br.ctop 


br.ctop 


br.ctop 


br.ctop 


br.ctop 


br.ctop 


br.ctop 


br.ctop 


and EC > 1, EC is decremented; this happens during the epilog phase. The 
instruction also control register rotation. If LC > 0, each execution of br.ctop 
places a 1 in p63. With rotation, p63 becomes p16, feeding a continuous se- 
quence of ones into the predicate registers during the prolog and kernel 
phases. If LC = 0, then br.ctop sets p63 to 0, feeding zeros into the predicate 
registers during the epilog phase. 


Table 21.4 shows a trace of the execution of this example. 


21.4 IA-64 INSTRUCTION SET ARCHITECTURE 


Figure 21.7 shows the set of registers available to application programs. That is, these 
registers are visible to applications and may be read and, in most cases, written. The 
register sets include: 


e General registers: 128 general-purpose 64-bit registers. Associated with each 
register is a NaT bit used to track deferred speculative exceptions, as ex- 
plained in Section 21.3. Registers r0 through r31 are referred to as static; a 
program reference to any of these references is literally interpreted. Registers 
r32 through 1127 can be used as rotating registers for software pipelining 
(discussed in Section 21.3) and for register stack implementation (discussed 
subsequently in this section). References to these registers are virtual, and the 
hardware my perform register renaming dynamically. 


e Floating-point registers: 128 82-bit registers for floating-point numbers. This 
size is sufficient to hold IEEE 754 double extended format numbers (see 
Table 9.3). Registers frO through fr31 are static, and registers fr32 through 
fr127 can be used as rotating registers for software pipelining. 
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Figure 21.7 TA-64 Application Register Set 
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e Predicate registers: 64 1-bit registers used as predicates. Register pr0 is always 
set to 1 to enable unpredicated instructions. Registers prO through pr15 are 
static, and registers pr16 through pr63 can be used as rotating registers for 
software pipelining. 

e Branch registers: 8 64-bit registers used for branches. 


e Instruction pointer: Holds the bundle address of the currently executing [A-64 
instruction. 


e Current frame marker: Holds state information relating to the current general 
register stack frame and rotation information for fr and pr registers. 


e User mask: A set of single-bit values used for alignment traps, performance 
monitors, and to monitor floating-point register usage. 


e Performance monitor data registers: Used to support performance monitor 
hardware. 


e Processor identifiers: Describe processor implementation-dependent features. 


e Application registers: A collection of special-purpose registers. Table 21.5 pro- 
vides a brief definition of each. 


Table 21.5 


Kernel registers (KRO-7) 


TA-64 Application Registers 


Convey information from the operating system to the 
application. 


Register stack configuration (RSC) 


Controls the operation of the register stack engine (RSE). 


RSE Backing store pointer (BSP) 


Holds the address in memory that is the save location 
for r32 in the current stack frame. 


RSE Backing store pointer to 
memory stores (BSPSTORE) 


Holds the address in memory to which the RSE will spill 
the next value. 


RSE NaT collection register 
(RNAT) 


Used by the RSE to temporarily hold NaT bits when it 
is spilling general registers. 


Compare and exchange value (CCV) 


Contains the compare value used as the third source 
operand in the cmpxchg instruction. 


User NaT collection register 
(UNAT) 


Used to temporarily hold NaT bits when saving and 
restoring general registers with the ld8.fill and st8.spill 
instructions. 


Floating-point status register (FPSR) 


Controls traps, rounding mode, precision control, flags, and 
other control bits for floating-point instructions. 


Interval time counter (ITC) 


Counts up at a fixed relationship to the processor clock 
frequency. 


Previous function state (PFS) 
Loop count (LC) 


Saves value in CFM register and related information. 


Used in counted loops and is decremented by counted- 
loop-type branches. 


Epilog count (EC) 


Used for counting the final (epilog) state in modulo- 
scheduled loops. 
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Figure 21.8 Register Stack Behavior on Procedure Call and Return 


Register Stack 


The register stack mechanism in IA-64 avoids unnecessary movement of data into and 
out of registers at procedure call and return. The mechanism automatically provides a 
called procedure with a new frame of up to 96 registers (132 through r127) upon proce- 
dure entry. The compiler specifies the number of registers required by a procedure with 
the alloc instruction, which specifies how many of these are local (used only within the 
procedure) and how many are output (used to pass parameters to a procedure called 
by this procedure). When a procedure call occurs, the [A-64 hardware renames regis- 
ters so that the local registers from the previous frame are hidden and what were the 
output registers of the calling procedure now have register numbers starting at r32 in 
the called procedure. Physical registers in the range r32 through r127 are allocated in a 
circular-buffer fashion to virtual registers associated with procedures. That is, the next 
register allocated after 1127 is r32. When necessary, the hardware moves register con- 
tents between registers and memory to free up additional registers when procedure 
calls occur, and restores contents from memory to registers as procedure returns occur. 

Figure 21.8 illustrates register stack behavior. The alloc instruction includes sof 
(size of frame) and sol (size of locals) operands to specify the required number of 
registers. These values are stored in the CFM register. When a call occurs, the sol and 
sof values from the CFM are stored in the sol and sof fields of the previous function 
state (PFS) application register (Figure 21.9). Upon return these sol and sof values 


L7-17 


RSC 


BSP, 
BSPSTORE 


RNAT 


PFS 


EC 


CFM 


34 14 11 L2, 2 
b 
loadrs : pl 
mode 
61 3 
Pointer 
1 63 
RSE NaT collection 
2 4 6 14 
ppl pec 
58 6 
epilog 
count 
6 7 7 4 7 7 
rrb.pr rrb.fr rrb.gr sor sol sof 


Figure 21.9 Formats of Some IA-64 Registers 
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must be restored from the PFS to the CFM. To allow nested calls and returns, previ- 
ous values of the PFS fields must be saved through successive calls so that they can 
be restored through successive returns. This is a function of the alloc instruction, 
which designates a general register to save the current value of the PFS fields before 
they are overwritten from the CFM fields. 


Current Frame Marker and Previous Function State 


The CFM register describes the state of the current general register stack frame, as- 
sociated with the currently active procedure. It includes the following fields: 


° sof: size of stack frame 
e sol: size of locals portion of stack frame 


° sor: size of rotating portion of stack frame; this is a subset of the local portion 
that is dedicated to software pipelining 


e register rename base values: Values used in performing register rotation gen- 
eral, floating-point and predicate registers 


The PFS application register contains the following fields: 


e pfm: Previous frame marker; contains all of the fields of the CFM 
e pec: Previous epilog count 
e ppl: Previous privilege level 


21.5 ITANIUM ORGANIZATION 


Intel’s Itanium processor is the first implementation of the [A-64 instruction set ar- 
chitecture. The first version of this implementation, known as Itanium, was released 
in 2001, followed in 2002 by the Itanium 2. The Itanium organization blends super- 
scalar features with support for the unique EPIC-related [A-64 features. Among the 
superscalar features are a six-wide, ten-stage-deep hardware pipeline, dynamic 
prefetch, branch prediction, and a register scoreboard to optimize for compile time 
nondeterminism. EPIC related hardware includes support for predicated execution, 
control and data speculation, and software pipelining. 

Figure 21.10 is a general block diagram of the Itanium organization. The Ita- 
nium includes nine execution units: two integer, two floating-point, four memory, 
and three branch execution units. Instructions are fetched through an L1 instruction 
cache and fed into a buffer that holds up to eight bundles of instructions. When de- 
ciding on functional units for instruction dispersal, the processor views at most two 
instruction bundles at a time. The processor can issue a maximum of six instructions 
per clock cycle. 

The organization is in some ways simpler than a conventional contemporary 
superscalar organization. The Itanium does not use reservation stations, reorder 
buffers, and memory ordering buffers, all replaced by simpler hardware for specula- 
tion. The register remapping hardware is simpler than the register aliasing typical of 
superscalar machines. Register dependency-detection logic is absent, replaced by 
explicit parallelism directives precomputed by the software. 
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Figure 21.10 Itanium 2 Processor Organization 


Using branch prediction, the fetch/prefetch engine can speculatively load 
an L1 instruction cache to minimize cache misses on instruction fetches. The 
fetched code is fed into a decoupling buffer that can hold up to eight bundles 
of code. 

Three levels of cache are used. The L1 cache is split into a 16-kbyte instruction 
cache and a 16-kbyte data cache, each 4-way set associative with a 32-byte line size. 
The 256-kbyte L2 cache is 6-way set associative with a 64-byte line size. The 3-Mbyte 
L3 cache is 4-way set associative with a 64-byte line size. All three levels of cache are 
on the same chip as the processor for the Itanium 2. For the original Itanium, the L3 
cache is off-chip but on the same package as the processor. 

The Itanium 2 uses an 8-stage pipeline for all but floating-point instructions. 
Figure 21.11 illustrates the relationship between the pipeline stages and the Itanium 
2 organization. The pipeline stages are: 


e Instruction pointer generation (IPG): Delivers an instruction pointer to the 
L1I cache. 


e Instruction rotation (ROT): Fetch instructions and rotate instructions into po- 
sition so that bundle 0 contains the first instruction that should be executed. 
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Figure 21.11 Itanium 2 Processor Pipeline [MCNAO3] 
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e Instruction template decode, expand and disperse (EXP): Decode instruction 
templates, and disperse up to 6 instructions through 11 ports in conjunction 
with opcode information for the execution units. 


e Rename and decode (REN): Rename (remap) registers for the register stack 
engine; decode instructions. 


e Register file read (REG): Delivers operands to execution units. 
e ALU execution (EXE): Execute operations. 


e Last stage for exception detection (DET): Detect exceptions; abandon result 
of execution if instruction predicate was not true; resteer mispredicted 
branches. 


e Write back (WRB): Write results back to register file. 


For floating-point instructions, the first five pipeline stages are the same as just 
listed, followed by four floating-point pipeline stages, followed by a write-back stage. 


21.6 RECOMMENDED READING AND WEB SITES 


[HUCKO00] provides an overview of IA-64; another overview is [DULO98]. [SCHL00a] pro- 
vides a general discussion of EPIC; a more thorough treatment is provided in [SCHLOOb]. 
Two other good treatments are [HWU01] and [KATH01]. [CHAS00] and [HWU98] provide 
introductions to predicated execution. Volume 1 of [INTE00a] contains a detailed treatment 
of software pipelining; two articles that provide a good explanation of the topic, with exam- 
ples, are [JARPO1] and [BHAROO]. 

For an overview of the Itanium processor architecture, see [SHA R00]; [INTE00b] provides 
amore detailed treatment. [MCNA03] and [NAFF02] describe the Itanium 2 in some detail. 

[EVANO03], [TRIE01], and [|MARKO0] contain more detailed treatments of the topics 
of this chapter. Finally, for an exhaustive look at the I[A-64 architecture and instruction set, 
see [INTE00a]. 


BHAR00 Bharandwaj, J., et al. “The Intel IA-64 Compiler Code Generator.” [EEE 
Micro, September/October 2000. 

CHAS00_ Chasin, A. “Predication, Speculation, and Modern CPUs.” Dr. Dobb’s Journal, 
May 2000. 

DULO98_Dulong, C. “The IA-64 Architecture at Work.” Computer, July 1998. 

EVAN03_ Evans, J., and Trimper, G. Itanium Architecture for Programmers. Upper Sad- 
dle River, NJ: Prentice Hall, 2003. 

HUCKO00 Huck, J., et al. “Introducing the IA-64 Architecture.” JEEE Micro, September/ 
October 2000. 

HWU98_ Hwu, W. “Introduction to Predicated Execution.” Computer, January 1998. 

HWUO01 Hwu, W:; August, D.; and Sias, J. “Program Decision Logic Optimization Using 
Predication and Control Speculation.” Proceedings of the IEEE, November 2001. 

INTE00a_ Intel Corp. Intel IA-64 Architecture Software Developer's Manual (4 volumes). 
Document 245317 through 245320. Aurora, CO, 2000. 


INTE00b Intel Corp. Itanium Processor Microarchitecture Reference for Software Opti- 
mization. Aurora, CO, Document 245473. August 2000. 
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MARKO00 Markstein, P. [A-64 and Elementary Functions. Upper Saddle River, NJ: Pren- 
tice Hall PTR, 2000. 
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IEEE Journal of Solid-State Circuits, November 2002. 
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Recommended Web sites: 


e Itanium: Intel’s site for the latest information on IA-64 and Itanium. 
e HP Itanium Technology site: Good source of information. 


e¢ IMPACT: This is a site at the University of Illinois, where much of the research on 
predicated execution has been done. A number of papers on the subject are available. 


21.7 KEY TERMS, REVIEW QUESTIONS, AND PROBLEMS 
Key Terms 


advanced load TA-64 architecture software pipeline 
branch predication instruction completer speculative loading 
bundle instruction group stack frame 
control speculation Itanium stop 


data speculation major opcode syllable 

execution unit NaT bit template field 

explicitly parallel instruction predicate register very long instruction word 
computing (EPIC) predication (VLIW) 

hoist register stack 
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Review Questions 


21.1 What are the different types of execution units for [A-64? 

21.2 Explain the use of the template field in an IA-64 bundle. 

21.3 What is the significance of a stop in the instruction stream? 

21.4 Define predication and predicated execution. 

21.5 Howcan predicates replace a conditional branch instruction? 

21.6 Define control speculation. 

21.7 What is the purpose of the NaT bit? 

21.8 Define data speculation. 

21.9 What is the difference between a hardware pipeline and a software pipeline? 

21.10 What is the difference between stacked and rotating registers? 
Problems 

21.1 Suppose that an IA-64 opcode accepts three registers as operands and produces one 
register as a result. What is the maximum number of different operations that can be 
defined in one major opcode family? 

21.2 What is the maximum effective number of major opcodes? 

21.3 Atacertain point in an IA-64 program, there are 10 A-type instructions and six floating- 
point instructions that can be issued concurrently. How many syllables may appear 
without any stops between them? 

21.4 In Problem 21.3, 

a. How many cycles are required for a small IA-64 implementation having one 
floating-point unit, two integer units, and two memory units? 
b. How many cycles are required for the Itanium organization of Figure 21.10? 

21.5 The initial Itanium implementation had two M-units and two I-units. Which of the 
templates in Table 21.3 cannot be paired as two bundles of instructions that could be 
executed completely in parallel? 

21.6 An algorithm that can utilize four floating-point instructions per cycle is coded for 
TA-64. Should instruction groups contain four floating-point operations? What are 
the consequences if the machine on which the program runs has fewer than four 
floating-point units? 

21.7 In Section 21.3, we introduced the following constructs for predicated execution: 


cmp.crel p2, p3 =a, b 
(p1) cmp.crel p2, p3 =a, b 


where crel is a relation, such as eq, ne, etc.; pl, p2, and p3 are predicate registers; a is 
either a register or an immediate operand; and b is a register operand. 


Fill in the following truth table: 


pl comparison p2 p3 


not present 0 


not present 


RlolRlolR 
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21.8 


21.9 


21.10 


21.11 


21.12 


For the predicated program in Section 21.3, which implements the flowchart of Figure 
21.4, indicate 

a. Those instructions that can be executed in parallel 

b. Those instructions that can be bundled into the same JA-64 instruction bundle 


The I[A-64 architecture includes a set of multimedia instructions comparable to those 
in the [A-32 Pentium architecture (Table 10.11). One such instruction type is the par- 
allel compare instruction of the form pemp1, pemp2, or pcmp4, which does a parallel 
compare 1, 2, or 4 bytes at a time. The instruction pcmp1.gt ri = rj, rk compares the 
two source operands (rj, rk) byte by byte. For each byte, if the byte in rj is greater than 
the byte in rk, then the corresponding byte in ri is set to all ones; otherwise the desti- 
nation byte is set to all zeros. Both operands are interpreted as signed. 

Suppose the registers r14 and 115 contain the ASCII strings (see Table F.1) 
“00000000” and “99999999” respectively and the register rl6 contains an arbitrary 
string of eight characters. Determine whether the comments in the following code 
fragment are appropriate. 


pemp1.gt r8 = r14,r16 // 1£ some char < “0” or 
pempl.gt r9 = ~16,r15 ;; // if some char > “9” 
cmp .ne p6,p0 = r8,r0 ;; // p6 = true or 
cmp .ne p7,p0 = r9,r0 ;; // p7 = true so that 
(p6) br error // this branch executes or 
(p7) br error ;; // this branch executes 


Consider the following source code segment: 


for ( i= 0; i < 100; i++ ) 
if (A[i] < 50 ) 
J Soy 4 a; 
else 
k = k++ 1; 


a. Write a corresponding Pentium assembly code segment. 
b. Rewrite as an IA-64 assembly code segment using predicated execution tech- 
niques. 


Consider the following C program fragment dealing with floating-point values: 


ali] =p *q 
ce = aljl; 
The compiler cannot establish that i # j, but has reason to believe that it probably is. 
a. Write an IA-64 program using an advanced load to implement this C program. 
Hint: the floating-point load and multiply instructions are ldf and fmpy, respec- 
tively. 
b. Recode the program using predication instead of the advanced load. 
c. What are the advantages and disadvantages of the two approaches compared with 
each other? 
Assume that a stack register frame is created with size equal to SOF = 48. If the size 
of the local register group is SOL = 16, 
a. How many output registers (SOO) are there? 
b. Which registers are in the local and output register groups? 


GLOSSARY 


absolute address An address in a computer language that identifies a storage location or a device 
without the use of any intermediate reference. 


accumulator The name of the CPU register in a single-address instruction format. The accumulator, 
or AC, is implicitly one of the two operands for the instruction. 


address bus That portion of a system bus used for the transfer of an address. Typically, the address 
identifies a main memory location or an I/O device. 


address space The range of addresses (memory, I/O) that can be referenced. 


arithmetic and logic unit (ALU) A part of a computer that performs arithmetic operations, logic 
operations, and related operations. 


ASCII American Standard Code for Information Interchange. ASCII is a 7-bit code used to repre- 
sent numeric, alphabetic, and special printable characters. It also includes codes for control characters, 
which are not printed or displayed but specify some control function. 


assembly language A computer-oriented language whose instructions are usually in one-to-one cor- 
respondence with computer instructions and that may provide facilities such as the use of macroin- 
structions. Synonymous with computer-dependent language. 


associative memory A memory whose storage locations are identified by their contents, or by a part 
of their contents, rather than by their names or positions. 


asynchronous timing A technique in which the occurrence of one event on a bus follows and depends 
on the occurrence of a previous event. 


autoindexing A form of indexed addressing in which the index register is automatically incremented 
or decremented with each memory reference. 


base In the numeration system commonly used in scientific papers, the number that is raised to the 
power denoted by the exponent and then multiplied by the mantissa to determine the real number rep- 
resented (e.g., the number 10 in the expression 2.7’ 10° = 270). 


base address A numeric value that is used as a reference in the calculation of addresses in the execu- 
tion of a computer program. 


binary operator An operator that represents an operation on two and only two operands. 
bit In the pure binary numeration system, either of the digits 0 and 1. 


block multiplexor channel A multiplexer channel that interleaves blocks of data. See also byte multi- 
plexor channel. Contrast with selector channel. 


branch prediction A mechanism used by the processor to predict the outcome of a program branch 
prior to its execution. 


buffer Storage used to compensate for a difference in rate of flow of data, or time of occurrence of 
events, when transferring data from one device to another. 
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bus A shared communications path consisting of one or a collection of lines. In some computer sys- 
tems, CPU, memory, and I/O components are connected by a common bus. Since the lines are shared 
by all components, only one component at a time can successfully transmit. 


bus arbitration The process of determining which competing bus master will be permitted access to 
the bus. 


bus master A device attached to a bus that is capable of initiating and controlling communication on 
the bus. 


byte A sequence of eight bits. Also referred to as an octet. 


byte multiplexor channel A multiplexer channel that interleaves bytes of data. See also block multi- 
plexor channel. Contrast with selector channel. 


cache A relatively small fast memory interposed between a larger, slower memory and the logic that 
accesses the larger memory. The cache holds recently accessed data, and is designed to speed up subse- 
quent access to the same data. 


cache coherence protocol A mechanism to maintain data validity among multiple caches so 
that every data access will always acquire the most recent version of the contents of a main 
memory word. 


cache line A block of data associated with a cache tag and the unit of transfer between cache and 
memory. 


cache memory A special buffer storage, smaller and faster than main storage, that is used to hold a 
copy of instructions and data in main storage that are likely to be needed next by the processor and 
that have been obtained automatically from main storage. 


CD-ROM Compact Disk Read-Only Memory. A nonerasable disk used for storing computer data. 
The standard system uses 12-cm disks and can hold more than 550 Mbytes. 


central processing unit (CPU) That portion of a computer that fetches and executes instructions. It 
consists of an Arithmetic and Logic Unit (ALU), a control unit, and registers. Often simply referred to 
as a processor. 


cluster A group of interconnected, whole computers working together as a unified computing 
resource that can create the illusion of being one machine. The term whole computer means a system 
that can run on its own, apart from the cluster. 


combinational circuit A logic device whose output values, at any given instant, depend only upon the 
input values at that time. A combinational circuit is a special case of a sequential circuit that does not 
have a storage capability. Synonymous with combinatorial circuit. 


compact disk (CD) A nonerasable disk that stores digitized audio information. 


computer architecture Those attributes of a system visible to a programmer or, put another way, 
those attributes that have a direct impact on the logical execution of a program. Examples of architec- 
tural attributes include the instruction set, the number of bits used to represent various data types 
(e.g., numbers, characters), I/O mechanisms, and techniques for addressing memory. 


computer instruction An instruction that can be recognized by the processing unit of the computer 
for which it is designed. Synonymous with machine instruction. 
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computer instruction set A complete set of the operators of the instructions of a computer together 
with a description of the types of meanings that can be attributed to their operands. Synonymous with 
machine instruction set. 


computer organization Refers to the operational units and their interconnections that realize the 
architectural specifications. Organizational attributes include those hardware details transparent to the 
programmer, such as control signals; interfaces between the computer and peripherals; and the mem- 
ory technology used. 


conditional jump A jump that takes place only when the instruction that specifies it is executed and 
specified conditions are satisfied. Contrast with unconditional jump. 


condition code A code that reflects the result of a previous operation (e.g., arithmetic). A CPU may 
include one or more condition codes, which may be stored separately within the CPU or as part of a 
larger control register. Also known as a flag. 


control bus That portion of a system bus used for the transfer of control signals. 


control registers CPU registers employed to control CPU operation. Most of these registers are not 
user visible. 


control storage A portion of storage that contains microcode. 


control unit That part of the CPU that controls CPU operations, including ALU operations, the 
movement of data within the CPU, and the exchange of data and control signals across external inter- 
faces (e.g., the system bus). 


daisy chain A method of device interconnection for determining interrupt priority by connecting the 
interrupt sources serially. 


data bus That portion of a system bus used for the transfer of data. 
data communication Data transfer between devices. The term generally excludes I/O. 


decoder A device that has a number of input lines of which any number may carry signals and a num- 
ber of output lines of which not more than one may carry a signal, there being a one-to-one correspon- 
dence between the outputs and the combinations of input signals. 


demand paging The transfer of a page from auxiliary storage to real storage at the moment of need. 


direct access The capability to obtain data from a storage device or to enter data into a storage device 
in a sequence independent of their relative position, by means of addresses that indicate the physical 
location of the data. 


direct address An address that designates the storage location of an item of data to be treated as 
operand. Synonymous with one-level address. 


direct memory access (DMA) _ A form of I/O in which a special module, called a DMA module, con- 
trols the exchange of data between main memory and an I/O module. The CPU sends a request for the 
transfer of a block of data to the DMA module and is interrupted only after the entire block has been 
transferred. 


disabled interrupt A condition, usually created by the CPU, during which the CPU will ignore inter- 
rupt request signals of a specified class. 
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diskette A flexible magnetic disk enclosed in a protective container. Synonymous with flexible disk. 


disk pack An assembly of magnetic disks that can be removed as a whole from a disk drive, together 
with a container from which the assembly must be separated when operating. 


disk stripping A type of disk array mapping in which logically contiguous blocks of data, or strips, are 
mapped round-robin to consecutive array members. A set of logically consecutive strips that maps ex- 
actly one strip to each array member is referred to as a stripe. 


dynamic RAM A RAM whose cells are implemented using capacitors. A dynamic RAM will gradu- 
ally lose its data unless it is periodically refreshed. 


emulation The imitation of all or part of one system by another, primarily by hardware, so that the 
imitating system accepts the same data, executes the same programs, and achieves the same results as 
the imitated system. 


enabled interrupt A condition, usually created by the CPU, during which the CPU will respond to in- 
terrupt request signals of a specified class. 


erasable optical disk A disk that uses optical technology but that can be easily erased and rewritten. 
Both 3.25-inch and 5.25-inch disks are in use. A typical capacity is 650 Mbytes. 


error-correcting code A code in which each character or signal conforms to specific rules of con- 
struction so that deviations from these rules indicate the presence of an error and in which some or all 
of the detected errors can be corrected automatically. 


error-detecting code A code in which each character or signal conforms to specific rules of construc- 
tion so that deviations from these rules indicate the presence of an error. 


execute cycle That portion of the instruction cycle during which the CPU performs the operation 
specified by the instruction opcode. 


fetch cycle That portion of the instruction cycle during which the CPU fetches from memory the in- 
struction to be executed. 


firmware Microcode stored in read-only memory. 


fixed-point representation system A radix numeration system in which the radix point is implicitly 
fixed in the series of digit places by some convention upon which agreement has been reached. 


flip-flop A circuit or device containing active elements, capable of assuming either one of two stable 
states at a given time. Synonymous with bistable circuit, toggle. 


floating-point representation system A numeration system in which a real number is represented by 
a pair of distinct numerals, the real number being the product of the fixed-point part, one of the nu- 
merals, and a value obtained by raising the implicit floating-point base to a power denoted by the ex- 
ponent in the floating-point representation, indicated by the second numeral. 


G Prefix meaning 2°°. 


gate An electronic circuit that produces an output signal that is a simple Boolean operation on its 
input signals. 


general-purpose register A register, usually explicitly addressable, within a set of registers, that can 
be used for different purposes, for example, as an accumulator, as an index register, or as a special han- 
dler of data. 
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global variable A variable defined in one portion of a computer program and used in at least one 
other portion of that computer program. 


high-performance computing (HPC) A research area dealing with supercomputers and the software 
that runs on supercomputers. The emphasis is on scientific applications, which may involve heavy use 
of vector and matrix computation, and parallel algorithms. 


immediate address The contents of an address part that contains the value of an operand rather than 
an address. Synonymous with zero-level address. 


indexed address An address that is modified by the content of an index register prior to or during the 
execution of a computer instruction. 


indexing A technique of address modification by means of index registers. 


index register A register whose contents can be used to modify an operand address during the exe- 
cution of computer instructions; it can also be used as a counter. An index register may be used to con- 
trol the execution of a loop, to control the use of an array, as a switch, for table lookup, or as a pointer. 


indirect address An address of a storage location that contains an address. 


indirect cycle That portion of the instruction cycle during which the CPU performs a memory access 
to convert an indirect address into a direct address. 


input-output (I/O) Pertaining to either input or output, or both. Refers to the movement of data be- 
tween a computer and a directly attached peripheral. 


instruction address register A special-purpose register used to hold the address of the next instruc- 
tion to be executed. 


instruction cycle The processing performed by a CPU to execute a single instruction. 


instruction format The layout of a computer instruction as a sequence of bits. The format divides the 
instruction into fields, corresponding to the constituent elements of the instruction (e.g., opcode, 
operands). 


instruction issue The process of initiating instruction execution in the processor’s functional units. 
This occurs when an instruction moves from the decode stage of the pipeline to the first execute stage 
of the pipeline 


instruction register A register that is used to hold an instruction for interpretation. 


integrated circuit (IC) A tiny piece of solid material, such as silicon, upon which is etched or im- 
printed a collection of electronic components and their interconnections. 


interrupt A suspension of a process, such as the execution of a computer program, caused by an event 
external to that process, and performed in such a way that the process can be resumed. Synonymous 
with interruption. 


interrupt cycle That portion of the instruction cycle during which the CPU checks for interrupts. If an 
enabled interrupt is pending, the CPU saves the current program state and resumes processing at an 
interrupt-handler routine. 


interrupt-driven /O A form of I/O. The CPU issues an I/O command, continues to execute subse- 
quent instructions, and is interrupted by the I/O module when the latter has completed its work. 


GLOSSARY 745 


W/O channel A relatively complex I/O module that relieves the CPU of the details of I/O operations. 
An I/O channel will execute a sequence of I/O commands from main memory without the need for 
CPU involvement. 


1/O controller A relatively simple I/O module that requires detailed control from the CPU or an I/O 
channel. Synonymous with device controller. 


V/O module One of the major component types of a computer. It is responsible for the control of one 
or more external devices (peripherals) and for the exchange of data between those devices and main 
memory and/or CPU registers. 


I/O processor An I/O module with its own processor, capable of executing its own specialized I/O in- 
structions or, in some cases, general-purpose machine instructions. 


isolated YO A method of addressing I/O modules and external devices. The I/O address space is 
treated separately from main memory address space. Specific I/O machine instructions must be used. 
Compare memory-mapped I/O. 


k Prefix meaning 2'° = 1024. Thus, 2 kb = 2048 bits. 


local variable A variable that is defined and used only in one specified portion of a computer 
program. 


locality of reference The tendency of a processor to access the same set of memory locations repeti- 
tively over a short period of time. 


M_ Prefix meaning 27” = 1,048,576. Thus, 2 Mb = 2,097,152 bits. 


magnetic disk A flat circular plate with a magnetizable surface layer, on one or both sides of which 
data can be stored. 


magnetic tape A tape with a magnetizable surface layer on which data can be stored by magnetic 
recording. 


mainframe A term originally referring to the cabinet containing the central processor unit or “main 
frame” of a large batch machine. After the emergence of smaller minicomputer designs in the early 
1970s, the traditional larger machines were described as mainframe computers, mainframes. Typical 
characteristics of a mainframe are that it supports a large database, has elaborate I/O hardware, and is 
used in a central data processing facility. 


main memory Program-addressable storage from which instructions and other data can be loaded 
directly into registers for subsequent execution or processing. 


memory address register (MAR) _ A register, in a processing unit, that contains the address of the 
storage location being accessed. 


memory buffer register (MBR)__A register that contains data read from memory or data to be written 
to memory. 


memory cycle time The inverse of the rate at which memory can be accessed. It is the minimum time 
between the response to one access request (read or write) and the response to the next access request. 


memory-mapped I/O A method of addressing I/O modules and external devices. A single address 
space is used for both main memory and I/O addresses, and the same machine instructions are used 
both for memory read/write and for I/O. 
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microcomputer A computer system whose processing unit is a microprocessor. A basic microcomputer 
includes a microprocessor, storage, and an input/output facility, which may or may not be on one chip. 


microinstruction An instruction that controls data flow and sequencing in a processor at a more fun- 
damental level than machine instructions. Individual machine instructions and perhaps other functions 
may be implemented by microprograms. 


micro-operation An elementary CPU operation, performed during one clock pulse. 


microprocessor A processor whose elements have been miniaturized into one or a few integrated 
circuits. 


microprogram A sequence of microinstructions that are in special storage where they can be dynam- 
ically accessed to perform various functions. 


microprogrammed CPU A CPU whose control unit is implemented using microprogramming. 
microprogramming language An instruction set used to specify microprograms. 


multiplexer A combinational circuit that connects multiple inputs to a single output. At any time, 
only one of the inputs is selected to be passed to the output. 


multiplexor channel A channel designed to operate with a number of I/O devices simultaneously. 
Several I/O devices can transfer records at the same time by interleaving items of data. See also byte 
multiplexor channel, block multiplexor channel. 


multiprocessor A computer that has two or more processors that have common access to a main 
storage. 


multiprogramming A mode of operation that provides for the interleaved execution of two or more 
computer programs by a single processor. 


multitasking A mode of operation that provides for the concurrent performance or interleaved exe- 
cution of two or more computer tasks. The same as multiprogramming, using different terminology. 


nonuniform memory access (NUMA) multiprocessor A shared-memory multiprocessor in which the 
access time from a given processor to a word in memory varies with the location of the memory word. 


nonvolatile memory Memory whose contents are stable and do not require a constant power source. 


nucleus That portion of an operating system that contains its basic and most frequently used func- 
tions. Often, the nucleus remains resident in main memory. 


ones complement representation Used to represent binary integers. A positive integer is represented 
as in sign magnitude. A negative integer is represented by reversing each bit in the representation of a 
positive integer of the same magnitude. 


opcode Abbreviated form for operation code. 
operand An entity on which an operation is performed. 


operating system Software that controls the execution of programs and that provides services such as 
resource allocation, scheduling, input/output control, and data management. 


operation code A code used to represent the operations of a computer. Usually abbreviated to 
opcode. 


GLOSSARY 747 


orthogonality A principle by which two variables or dimensions are independent of one another. In 
the context of an instruction set, the term is generally used to indicate that other elements of an in- 
struction (address mode, number of operands, length of operand) are independent of (not determined 
by) opcode. 


page Ina virtual storage system, a fixed-length block that has a virtual address and that is transferred 
as a unit between real storage and auxiliary storage. 


page fault Occurs when the page containing a referenced word is not in main memory. This causes an 
interrupt and requires the operating system to bring in the needed page. 


page frame An area of main storage used to hold a page. 


parity Bit A binary digit appended to a group of binary digits to make the sum of all the digits either 
always odd (odd parity) or always even (even parity). 


peripheral equipment Ina computer system, with respect to a particular processing unit, any equipment 
that provides the processing unit with outside communication. Synonymous with peripheral device. 


pipeline A processor organization in which the processor consists of a number of stages, allowing 
multiple instructions to be executed concurrently. 


predicated execution A mechanism that supports the conditional execution of individual instruc- 
tions. This makes it possible to execute speculatively both branches of a branch instruction and retain 
the results of the branch that is ultimately taken. 


process A program in execution. A process is controlled and scheduled by the operating system. 


process control block The manifestation of a process in an operating system. It is a data structure 
containing information about the characteristics and state of the process. 


processor Ina computer, a functional unit that interprets and executes instructions. A processor con- 
sists of at least an instruction control unit and an arithmetic unit. 


processor cycle time The time required for the shortest well-defined CPU micro-operation. It is the 
basic unit of time for measuring all CPU actions. Synonymous with machine cycle time. 


program counter Instruction address register. 


programmable logic array (PLA) An array of gates whose interconnections can be programmed to 
perform a specific logical function. 


programmable read-only memory (PROM) Semiconductor memory whose contents may be set only 
once. The writing process is performed electrically and may be performed by the user at a time later 
than original chip fabrication. 


programmed I/O A form of I/O in which the CPU issues an I/O command to an I/O module and must 
then wait for the operation to be complete before proceeding. 


program status word (PSW) An area in storage used to indicate the order in which instructions are 
executed, and to hold and indicate the status of the computer system. Synonymous with processor sta- 
tus word. 


random-access memory (RAM) Memory in which each addressable location has a unique addressing 
mechanism. The time to access a given location is independent of the sequence of prior access. 
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read-only memory (ROM) Semiconductor memory whose contents cannot be altered, except by 
destroying the storage unit. Nonerasable memory. 


redundant array of independent disks (RAID) A disk array in which part of the physical storage ca- 
pacity is used to store redundant information about user data stored on the remainder of the storage 
capacity. The redundant information enables regeneration of user data in the event that one of the 
array’s member disks or the access path to it fails. 


registers High-speed memory internal to the CPU. Some registers are user visible; that is, available to 
the programmer via the machine instruction set. Other registers are used only by the CPU, for control 
purposes. 


scalar A quantity characterized by a single value. 


secondary memory Memory located outside the computer system itself; that is, it cannot be 
processed directly by the processor. It must first be copied into main memory. Examples include disk 
and tape. 


selector channel An I/O channel designed to operate with only one I/O device at a time. Once the 
I/O device is selected, a complete record is transferred one byte at a time. Contrast with block multi- 
plexor channel, multiplexor channel. 


semiconductor A solid crystalline substance, such as silicon or germanium, whose electrical conduc- 
tivity is intermediate between insulators and good conductors. Used to fabricate transistors and solid- 
state components. 


sequential circuit A digital logic circuit whose output depends on the current input plus the state of 
the circuit. Sequential circuits thus possess the attribute of memory. 


sign-magnitude representation Used to represent binary integers. In an N-bit word, the leftmost bit 
is the sign (0 = positive, 1 = negative) and the remaining N — 1 bits comprise the magnitude of the 
number. 


solid-state component A component whose operation depends on the control of electric or magnetic 
phenomena in solids (e.g., transistor crystal diode, ferrite core). 


speculative execution The execution of instructions along one path of a branch. If it later turns out 
that this branch was not taken, then the results of the speculative execution are discarded. 


stack An ordered list in which items are appended to and deleted from the same end of the list, 
known as the top. That is, the next item appended to the list is put on the top, and the next item to be 
removed from the list is the item that has been in the list the shortest time. This method is characterized 
as last-in-first-out. 


static RAM A RAM whose cells are implemented using flip-flops. A static RAM will hold its data as 
long as power is supplied to it; no periodic refresh is required. 


superpipelined processor A processor design in which the instruction pipeline consists of many very 
small stages, so that more than one pipeline stage can be executed during one clock cycle and so that a 
large number of instructions may be in the pipeline at the same time. 


superscalar processor A processor design that includes multiple-instruction pipelines, so that more 
than one instruction can be executing in the same pipeline stage simultaneously. 
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symmetric multiprocessing (SMP) A form of multiprocessing that allows the operating system to 
execute on any available processor or on several available processors simultaneously. 


synchronous timing A technique in which the occurrence of events on a bus is determined by a clock. 
The clock defines equal-width time slots, and events begin only at the beginning of a time slot. 


system bus A bus used to interconnect major computer components (CPU, memory, I/O). 


truth table A table that describes a logic function by listing all possible combinations of input values 
and indicating, for each combination, the output value. 


twos complement representation Used to represent binary integers. A positive integer is represented 
as in sign magnitude. A negative number is represented by taking the Boolean complement of each bit 
of the corresponding positive number, then adding | to the resulting bit pattern viewed as an unsigned 
integer. 


unary operator An operator that represents an operation on one and only one operand. 
unconditional jump A jump that takes place whenever the instruction that specified it is executed. 


uniprocessing Sequential execution of instructions by a processing unit, or independent use of a pro- 
cessing unit in a multiprocessing system. 


user-visible registers CPU registers that may be referenced by the programmer. The instruction-set 
format allows one or more registers to be specified as operands or addresses of operands. 


vector A quantity usually characterized by an ordered set of scalars. 


very long instruction word (VLIW) Refers to the use of instructions that contain multiple operations. 
In effect, multiple instructions are contained in a single word. Typically, a VLIW is constructed by the 
compiler, which places operations that may be executed in parallel in the same word. 


virtual storage The storage space that may be regarded as addressable main storage by the user of a 
computer system in which virtual addresses are mapped into real addresses. The size of virtual storage 
is limited by the addressing scheme of the computer system and by the amount of auxiliary storage 
available, and not by the actual number of main storage locations. 


volatile memory A memory in which a constant electrical power source is required to maintain the 
contents of memory. If the power is switched off, the stored information is lost. 


word An ordered set of bytes or bits that is the normal unit in which information may be stored, 
transmitted, or operated on within a given computer. Typically, if a processor has a fixed-length 
instruction set, then the instruction length equals the word length. 
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Intel x86 system, 378 
registers for, 436-437 
Conditional branch instructions, 
23-24, 370-371, 447-448 
Constant angular velocity (CAV), 
187, 206 
Constant linear velocity (CLV), 207 
Continuous-field simulation, 664 
Control, 221-222, 364, 369-374, 588- 
590, 592-596, 619-620 
branch instructions, 370-371 
T/O modules, 222 
logic, 221 
memory, 589-590 
microinstructions, 588-590, 595-596 
operations (opcode), 364, 369-374 
procedure call instructions, 371-374 
skip instructions, 371 
stack implementation, 373-374 
system operations, 364, 369 
TI 8800 SBD microsequencer 
fields, 619-620 
transfer of, 364, 369-374 
Wilkes microprogrammed control, 
592-596 
word, 588 
Control address register (CAR), 598 
Control registers, 435, 437-439, 
464-466 
Control hazards, pipelining, 454 
Control lines, 86 
Control signals, 220, 570-574, 577-578 
control unit (CU), 570-574 
input/output (I/O), 220 
Intel 8085 processor, 575-581 


Control unit (CU), 15, 481, 559-626 
control signals, 570-574, 577-578 
execution, 570, 602-614 
functional requirements, 569-570 
hardwired implementation, 581- 
584, 594 

inputs, 581-538 

Intel 8085 processor, 575-581 

internal processor organization, 
574-575 

logic, 583-584 

microinstructions, 587-590, 596-624 

micro-operations (micro-ops), 
563-569 

microprogrammed, 481, 559-560, 
586-626 

operation, 559, 561-585 

processor control, 15, 569-581 

sequencing, 570, 596-602 

Controllers, 90-91, 232-234, 239-242, 
224 

bus arbiter, 90-91 

channels and, 243 

input/output (I/O), 232-234, 
239-242, 224 

Intel 8237A DMA, 239-242 

Intel 82C59A interrupt, 232-234 

Conversion operations (opcode), 
363, 368-369 

Core, magnetization and, 36 

Cortex-A8 processor, 544-552 
architecture, 544-546 
decode instruction unit, 547-548 
fetch instruction unit, 546-547 
floating-point pipeline, 551-552 
instruction-level parallelism and, 

544-552 
integer execute unit, 548-551 
single-instruction multiple-data 
(SIMD) instructions, 551-552 
superscalar design, 544-552 
Current program status registers 
(CPSR), ARM, 472-474 

Cycle master, 250 

Cycle stealing, 236-237 

Cycle time, pipelining, 450 

Cycles per instruction (CPI), 51-53 

Cylinders, magnetic disks, 190 


Daisy chain technique, I/O, 232 
Data, 11-13, 22, 24, 28, 220, 352-353, 
356-361 

ARM types, 360-361 

channels, 28 

control, 13 

input/output (I/O), 220 

Intel x86 types, 358-360 

machine instruction types, 

352-353, 356-361 

movement of, 12-13 

operands, 356-358 

processing, 11 

storage, 11-12 

transfer instructions, 22, 24 

Data buffering, I/O modules, 223 
Data exchanges, 84-85 
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Data flow, instruction cycles, 441-444 
Data formatting, magnetic disks, 
186-188 
Data hazards, pipelining, 453 
Data (bus) lines, 85-86 
Data processing instruction 
addressing, ARM, 411 
Data registers, 436 
Data transfer, 93-94, 100-102, 362, 
364-365 
bus interconnection, 93-94 
operations (opcode), 362, 364-365 
peripheral component 
interconnection (PCI), 
100-102 
Database scaling, 691-693 
Decode instruction unit, Cortex-A8 
processor, 547-548 
Decoder, 592 
Delayed branch, pipelining, 459, 
501-503 
Delayed load, pipelining, 503 
Demand paging, 283-284 
Denormalized numbers, IEEE 
standards, 342 
Device communication, I/O 
modules, 223 
Digital Equipment Corporation 
(DEC), PDP series computers, 
26, 33-35, 416-419 
Digital versatile disk (DVD), 205, 
208-210 
Dijkstra’s algorithm, 394-395 
Direct access, 113 
Direct addressing mode, 402-404 
Direct mapping, 125-129 
Direct memory access (DMA), 84, 
218, 236-242 
configurations, 237-239 
cycle stealing, 236-237 
direct memory access (DMA), 84, 
218, 236-242 
function, 218, 236-239 
Intel 8237A controller, 239-242 
interconnection structure and, 84 
registers, 240-242 
Directory protocols, 641-642 
Dirty (use) bit, 137 
Discrete components, 28-29 
Disk cache memory, 151 
Disk drive, I/O, 221 
Disks, see Magnetic disks; Optical 
memory systems 
Dispatcher, 272 
Displacement addressing mode, 
402-403, 405-407 
Distributed interrupt controller 
(DIC), 699-701 
Division, 324-327, 337-340 
floating-point numbers, 337-340 
partial remainder, 324-325 
twos complement restoring 
algorithm, 325-327 
Double-data-rate DRAM (DDR 
DRAM), 178-179 
Double-sided disk, 189 
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Drive, Pentium 4 processor, 540, 542 


Dual redundancy disk performance 
(RAID level 6), 196-198, 23 
Dynamic random-access memory 


(DRAM), 39, 160-162, 164-166, 


173-179 
cache (CDRAM), 179 
chip logic, 164-166 


double-data-rate (DDR DRAM), 


178-179 
high-performance processors, 
173-179 
internal main memory, 160-162 
Rambus (RDRAM), 177 
synchronous (SDRAM), 174-177 


EFLAGS register, Intel x86 
processors, 462-464 


Electrically erasable programmable 


read-only memory 
(EEPROM), 160, 163 


Electronic Numerical Integrator and 


Computer (ENIAC), 17-18 
Embedded systems, 46-48 
Encoding, microinstruction 

execution, 606-607 
Endian byte orders, 396-399 
Erasable programmable read-only 

memory (EPROM), 160, 163, 

166-167 

chip packaging, 166-167 

internal main memory, 160, 163 
Error correction, 169-173 

code functions, 169 

Hamming code, 170 

hard failure, 169 

internal memory, 169-173 

parity bits, 170 

semiconductor memory, 169-173 

single error-correcting (SEC) 

code, 173 

soft error, 169 

syndrome words, 170-171 
Error detection, I/O modules, 223 
Exceptions, interrupts and, 467, 

474-475 
Execute cycle, 22, 69-74, 440, 443, 

567-568 

computer instructions, 22, 69-74 

micro-operations (micro-ops), 

567-568 
processor instruction, 440, 443 
Execution, 51-53, 218, 224-226, 274- 

276, 482-487, 530-533, 536-537, 
543-544, 570, 602-614, 646 

control unit (CU), 570, 602-614 

encoding, 606-607 

I/O techniques, 218, 224-226 

IBM 3033 processor, 612-614 

instruction rate, 51-53 

LSI-11 processor, 607-612 

microinstructions, 602-614 

multithreading, 646 

out-of-order, 530-533, 543-544 

process, 274-276, 646 


RISC machine instructions, 482-487 


superscalar programs, 536-537 
taxonomy of, 602-606 
Exponent value, 328, 334 
External memory, 64, 184-216 
direct-access devices, 212 
magnetic disks, 185-194 
magnetic tape, 210-212 
optical systems, 203-210 


Redundant Array of Independent 


Disks (RAID), 185, 194-205 
sequential-access devices, 212 


Failure management, clusters, 656 
Fairness arbitration intervals, 248 
Family concept, 481 
Fetch cycle, 21, 69-74, 440, 443, 
564-566 
computer instructions, 21, 69-74 
micro-operations (micro-ops), 
564-566 
processor instructions, 440, 443 
Fetch instruction unit, Cortex-A8 
processor, 546-547 
Fetch overlap, pipelining, 445, 454 
FireWire serial bus, 246-250 
configurations, 246-248 
cycle master, 250 
link layer, 247-250 
physical layer, 247-248 
transaction layer, 274-248 
Firmware, 588 


First-in first-out (FIFO) algorithm, 136 


Fixed-head disk, 188-189 
Fixed-point notation, see Integers 
Fixed-size partitions, 278-279 
Flash memory, 160, 163 
Floating-point notation, 327-342, 
544, 551-552 
addition, 334-337 
arithmetic and logic unit (ALU) 
data, 327-342 
arithmetic, 334-342 
biased representation, 328 
Cortex-A8 processor pipelining, 
551-552 
denormalized numbers, 342 
division, 337-340 
exponent value, 328, 334 
guard bits, 338-339 
IEEE standards for, 331-333, 
339-342 
infinity interpretation, 341 
multiplication, 337-340 
NaNs, 341 
normalized numbers, 328-329 
overflow, 330, 33 
Pentium 4 execution unit, 544 
precision considerations, 338-340 
principles, 327-331 
representation, 327-333 
rounding, 339-340 
significand, 328, 334 
subtraction, 334-337 
underflow, 330, 334, 341 
Floppy (contact) magnetic disks, 
189, 191 


Frames, I/O memory, 281-282 
Front end, Pentium 4 processor, 
539-542 
Fully nested interrupt mode, 233 
Functional encoding, 607 
Functions, 10-13, 22-23, 28, 65-83, 
104-108, 222-233, 242-243 
components and, 28, 65-83 
computer operation and, 10-13 
execute cycle, 22, 69-74 
fetch cycle, 21, 69-74 
hardwired programs, 67 
T/O channels, 243-244 
T/O modules, 83-84, 222-223, 
242-243 
TAS computer operation, 21-24 
input/output (I/O), 83-84, 222-223, 
242-243 
instruction cycle, 21-24, 69-74, 76-83 
interrupts, 74-83 
software components, 67-68 
von Neuman architecture 
and, 66-68 


Gaps, magnetic disks, 185 
General-purpose registers, 436, 472 
Geometric mean, 55 
Global history buffer (GHB), 547 
Global variable storage, 

registers, 490 
Gradual underflow, 342 
Grant (GNT) signal, PCI, 102-104 
Guard bits, 338-339 


Hamming code, 170 
Hard disk, 191 
Hard failure, 169 
Hardware, 641-642, 685-690 
cache coherence solutions, 
641-642 
multicore computers performance, 
685-690 
parallelism increase, 685-687 
power consumption, 687-690 
Hardware transparency 
approach, 137 
Hardwired control unit (CU), 
581-584, 594 
Hardwired programs, 67 
Harmonic mean, 54-55 
Hash functions, 284-285 
Heads, magnetic disks, 185-186, 
188-189 
High-definition optical disks 
(HD DVD), 210 
High-level language (HLL), 152-153, 
482-487 
operands, 485-486 
operations, 484-485 
performance characteristics, 
152-153 
procedure calls, 486-487 
reduced instruction set computers 
(RISC), 482-487 
semantic gap and, 482-484 
Host channel adapter (HCA), 251 


TAS computer, 18-24 
IBM, see International Business 
Machines (IBM) 
IEEE, see Institute of Electrical and 
Electronics Engineers (IEEE) 
Immediate addressing mode, 402-403 
Immediate constants, ARM, 424-425 
Indexing, 406-407 
Indirect addressing mode, 402-404 
Indirect instruction cycle, 441, 443, 566 
InfiniBand, 250-253 
Infinity, IEEE interpretation, 341 
Infix notation, 393 
Input/Output (I/O), 13, 14, 64, 68, 
83-84, 199-200, 217-258, 364, 369 
address register (I/OAR), 68 
buffer register (I/OBR), 68 
channels, 224, 243-244 
component functions, 28 
computer systems, 64, 68, 217-258 
controllers, 232-234, 239-242, 224 
data movement and, 14 
direct memory access (DMA), 84, 
218, 236-242 
disk drive, 221 
execution techniques, 218, 224-226 
FireWire serial bus, 246-250 
function, 242-243 
high data-transfer capacity, 199 
high request rate, 200 
InfiniBand, 250-253 
Intel 82C55A programmable 
peripheral interface, 234-236 
Intel 82C59A interrupt controller, 
232-234 
Intel 8237A DMA controller, 
239-242 
interconnection structure, 83-84 
interfaces, 218-219, 234-236, 
244-253 
interrupt-driven, 218, 228-236 
keyboard/monitor arrangement, 221 
modules, 83-84, 218-219, 222-224, 
242-243 
multipoint interfaces, 245-246 
operations (opcode), 364, 369 
peripheral (external) devices, 
219-221 
peripheral data devices, 13 
point-to-point interfaces, 245 
programmed, 218, 224-228, 
234-236 
RAID 0 performance for, 199-200 
T/O channels, 224, 243-244 
I/O modules, 83-84, 218-219, 222-224, 
242-243 
computer functions and, 83 
control and timing 
requirements, 222 
data buffering, 223 
device communication, 223 
error detection, 223 
evolution of, 242-243 
function, 83-84, 222-223 
input/output interfaces and, 
218-219 


interconnection structure, 83-84 
processor communication, 
84, 222-223 
structure, 223-224 
Institute of Electrical and 
Electronics Engineers (IEEE), 
3-4, 331-333, 339-342 
denormalized number 
standards, 342 
floating-point notation standards, 
331-333, 339-342 
infinity interpretation, 341 
Joint Task Force publications, 3-4 
NaN standards, 341 
rounding approaches, 339-340 
Instruction buffer register (IBR), 21 
Instruction cycle, 21-24, 69-83, 
440-444, 568-569. See also 
Direct memory access (DMA) 
code (ICC), 568-569 
computer functions, 69-83 
data flow, 441-444 
execute cycle, 22, 69-74, 440, 443 
fetch cycle, 21, 69-74, 440, 443 
TAS computer, 21-24 
indirect cycle, 441, 443 
1/O modules, 83 
interrupt cycle, 76-78, 440, 444 
interrupts and, 74-83 
micro-operations (micro-ops), 
568-569 
multiple interrupts, 78-83 
processor, 440-444 
state diagrams, 73, 80, 442 
Instruction execution rate, 51-53 
Instruction formats, 304, 413-428, 
497-498, 507, 515-517 
Advanced RISC Machine (ARM), 
424-426 
assembly language, 426-428 
bit allocation, 414-418 
Intel x86, 421-424 
length, 413-414 
MIPS R4000 microprocessor, 507 
PDP-8 design, 416-417 
PDP-11 design, 418-419 
PDP-10 design, 417-418 
reduced instruction set computers 
(RISC), 497-498, 507, 515-517 
Scalable Processor Architecture 
(SPARC), 515-517 
variable-length, 418-421 
VAX design, 420-421 
Instruction-level parallelism, 304, 
522-557 
Advanced RISC Machine (ARM) 
Cortex-A8 processor, 544-552 
antidependency, 532-533 
branch prediction, 536 
degree of instruction execution 
and, 526-528 
execution of superscalar programs, 
536-537 
implementation of superscalar 
programs, 537 
instruction issue policy, 529-533 
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Intel Pentium 4 processor, 538-544 
machine parallelism and, 528-529, 
534-536 
output dependency, 530-532 
procedural dependency, 528 
register renaming, 533-534, 543 
resource conflict, 528 
superscalar processors and, 304, 
483, 522-557 
true data (flow) dependency, 
526-528 
Instruction prefetch (fetch overlap), 
445, 454 
Instruction register (IR), 21, 
70, 438 
Instruction sets, 52, 303-304, 348-431, 
505-507, 513-515, 674-676 
addressing modes, 304, 400-413 
Advanced RISC Machine (ARM), 
360-361, 381-384, 411-413, 
424-426 
architecture, 52 
assembly language, 426-428 
central processing unit (CPU) 
functions, 303-304, 348-431 
data types, 358-361 
design, 355-356 
endian byte orders, 396-399 
IBM 3090 vector facility ALU, 
674-676 
instruction formats, 304, 413-428 
Intel x86, 358-360, 374-381, 408- 
410, 421-424 
machine instructions, 348-390 
MIPS R4000 microprocessor, 
505-507 
operands, 303, 350-351, 356-358 
operations (opcode), 303, 350, 
362-384 
reduced instruction set computers 
(RISC), 505-507, 513-515 
Scalable Processor Architecture 
(SPARC), 513-515 
stacks, 390-395 
Instruction window, 532 
Instructions, see Machine 
instructions, Micro-operations 
Integers, 307-327, 544, 548-551 
addition, 314-317 
arithmetic and logic unit (ALU) 
data, 307-327 
arithmetic, 312-327 
converting between bit lengths, 
310-312 
Cortex-A8 processor execute unit, 
548-551 
division, 324-327 
fixed-point, 312 
multiplication, 317-324 
negation, 313-314 
overflow, 314-315 
Pentium 4 processor execution 
unit, 544 
representation, 307-312 
sign magnitude, 308 
subtraction, 314-317 


768 INDEX 


Integers (continued) 
twos complement, 308-310, 
312-327 
unsigned multiplication, 317-319 
Integrated circuits, development of, 
28-34 
Intel x86 system, 2, 44-46, 232-236, 
239-242, 288-293, 358-360, 
374-381, 408-410, 421-424, 
439-440, 460-469, 538-544, 
575-581, 651, 696-699 
addressing mode, 408-410 
cache memory, 140-143 
call/return instructions, 377 
chip multiprocessing, 651 
condition codes, 378 
control register, 464-466 
Core Duo, 696-698 
Core 17, 698-699 
CPU instruction sets, 358-360, 
374-381, 408-410, 421-424 
data types, 358-360 
direct memory access (DMA) and, 
239-242 
EFLAGS register, 462-464 
82C55A programmable peripheral 
interface, 234-236 
82C59A interrupt controller, 
232-234 
8237A DMA controller, 239-242 
8085 processor control unit (CU), 
575-581 
8086 microprocessor registers, 
439-440 
80486 information pipelining, 
460-461 
80386 microprocessor registers, 
439-440 
evolution of, 44-46 
external signals, 577-578 
I/O memory management, 
288-293 
instruction format, 421-424 
instruction-level parallelism and, 
538-544 
interrupt-driven I/O and, 232-236 
interrupt processing, 467-469 
machine instructions, 358-360, 
374-381 
memory management 
instructions, 377 
MMxX (mutimedia task) 
instructions, 378-381 
MMxX registers, 466-467 
multicore computer organization, 
696-699 
operations (opcode), 374-381 
Pentium 4 processor, 140-143, 
538-544, 651 
Pentium II processor, 288-293 
processor organization, 461-469 
programmable I/O and, 234-236 
register organization, 461-467 
single-instruction multiple-data 
(SIMD) instructions, 378-381 
status flags, 378 


superscalar processor design, 
538-544 
timing operation, 578-581 
Interactive operating system 
(OS), 264 
Interconnections, 14-15, 63, 83-108, 639 
bus, 15, 85-94 
computer structure and, 14-15, 63 
data exchanges, 84-85 
T/O modules, 83-84 
memory modules, 83 
peripheral component (PCI), 95-104 
processor signals, 84 
switched, SMP, 639 
Interfaces, 218-219, 234-236, 244-253 
external I/O, 244-253 
FireWire serial bus, 246-250 
InfiniBand, 250-253 
input/output (I/O), 218-219, 
234-236, 244-253 
T/O modules, 218-219 
Intel 82C55A programmable 
peripheral, 234-236 
multipoint, 245-246 
parallel I/O, 244-245 
point-to-point, 245 
serial I/O, 244-245 
Interleaved memory, 168 
Interleaved multithreading, 647-650 
Intermediate queues, 277-278 
Internal memory, 64, 158-183 
chips, 164-168 
dynamic random-access memory 
(DRAM), 160-162, 164-166, 
173-179 
electrically erasable programmable 
read-only memory 
(EEPROM), 160, 163 
erasable programmable read-only 
memory (EPROM), 160, 163, 
166-167 
error correction, 169-173 
flash memory, 160, 163 
high-level performance, 173-179 
interleaved, 168 
main (cell), 159-168 
programmable read-only memory 
(PROM), 160, 163 
random-access memory (RAM), 
160-161 
read-only memory (ROM), 160, 
162-163 
semiconductors, 159-183 
static random-access memory 
(SRAM), 162 
International Business Machines 
(IBM), 25-28, 31-33, 600-601, 
612-614, 637-640, 651-653, 
670-676 
address generation sequencing, 
600-601 
ALU instruction set, 674-676 
compound instruction 
execution, 674 
PowerS chip multiprocessing, 
651-653 


register-to-register organization, 
671-647 
700/7000 series computers, 25-28 
360 series computer, 31-33 
3033 processor microinstructions, 
600-601, 612-614 
3090 vector facility, 670-676 
z990 SMP mainframes, 637-640 
International Reference Alphabet 
(IRA), 221 
Interrecord gaps, 211 
Interrupt cycle, 74-83, 80, 440, 444, 
566-567 
computer instructions, 76-78, 80 
micro-operations (micro-ops), 
566-567 
processor instructions, 440, 444 
Interrupt-driven I/O, 218, 224-226, 
228-236 
bus arbitration technique, 232 
daisy chain technique, 232 
design and implementation of, 
230-232 
drawbacks of, 236 
execution, 218, 224-226 
Intel 82C55A programmable 
peripheral interface, 234-236 
Intel 82C59A interrupt controller, 
232-234 
multiple interrupt lines, 231 
interrupt processing, 228-230 
programmed I/O and, 224-226, 
234-236 
software poll technique, 232 
Interrupt service routine (ISR), 82-83 
Interrupts, 74-83, 228-232, 233-234, 
267, 275, 467-469, 474-475, 697, 
699-703 
advanced programmable interrupt 
controller (APIC), 697 
Advanced RISC Machine (ARM) 
processing, 474-475 
ARM11 MPCore, 699-703 
disabled, 79 
distributed interrupt controller 
(DIC), 699-701 
exceptions and, 467, 474-475 
fully nested mode, 233 
handling, 77, 469, 700-703 
instruction cycle and, 74-83 
Intel 82C59A modes, 232-234 
Intel x86 processing, 467-469, 697 
multicore computers, 697, 699-703 
multiple, 78-83, 230-232 
operating system (OS) 
hardware, 267 
processing, 228-230 
program flow of control and, 74-76 
request signal, 76 
rotating mode, 234 
scheduling process, 275 
special mask mode, 234 
vector tables, 467-468, 474 
vectored, 232 
Isochronous data transmission, 248-249 
Isolated I/O, 227 


Job, operating system (OS), 264 
Job control language (JCL), 266 


Kernel (nucleus), 263 
Keyboard arrangement, I/O, 221 


Lands, compact disks, 206 
Large-scale integration (LSI), 34 
Last-in-first-out (LIFO) queue, 
390-392, 407 
Leading edge, 108-109 
Least-frequently used (LFU) 
algorithm, 136 
Least-recently used (LRU) 
algorithm, 136, 283 
Linear tape-open (LTO) system, 212 
Lines, cache memory, 119-110, 138 
Link layer, 247-250, 253 
Links, InfiniBand, 251 
Little endian ordering, 396-399 
Load balancing, clusters, 656 
Load/store addressing, ARM, 411 
Load/store multiple addressing, 
ARM, 412-413 
Locality of reference, 117, 151-153 
Logic (Boolean) instructions, 353 
Logic-memory performance 
balance, 40-42 
Logical address, 281, 282 
Logical data operands, 358 
Logical operations (opcode), 363, 
366-368 
Logical shift, 366-367 
Long-term scheduling, 271-272 
Loop buffer, pipelining, 454-455 
Loop unrolling, pipelining, 503-504 
LSI-11 processor, 601-602, 607-612 
control unit (CU) organization, 
609-611 
execution, 607-612 
microinstruction format, 611-612 
microinstructions, 601-602, 607-612 
sequencing, 601-602 


Machine cycles, 578-581 
Machine instructions, 348-390, 
482-487, 496 

addresses, 353-355 

Advanced RISC Machine (ARM), 
360-361, 381-384 

arithmetic, 353 

branch, 353, 370-371 

data types, 352-353, 356-361 

elements of, 350-351 

high-level languages (HLL) and, 
482-. 

instruction set design, 355-356 

Intel x86, 358-360, 374-381 

logic (Boolean), 353 

memory, 353 

operands, 350-351, 356-358, 
485-486 

operations (opcode), 350, 362-384, 
484-485 

procedure calls, 371-374, 377, 
486-487 


reduced instruction set computers 
(RISC), 482-487, 496 
RISC execution, 482-487 
symbolic representation, 351-352 
test, 353 
Machine parallelism, 528-529, 534-536 
Magnetic disks, 185-194 
constant angular velocity (CAV), 
187-188 
cylinders, 190 
data formatting, 186-188 
floppy (contact), 189, 191 
heads, 185-186, 188-189 
multiple platters, 190 
multiple zone recording, 188 
parameters, 191-194 
read mechanisms, 185-186 
rotational delay (latency), 192-193 
rotational positional sensing 
(RPS), 192 
seek time, 192-193 
sequential organization, 193-194 
single and double sides, 189 
tracks, 185, 190-191 
transfer time, 192-193 
Winchester format, 189, 191 
write mechanisms, 185-186 
Magnetic tape, 210-212 
Magnetoresistive sensor, 186 
Main memory, 14, 68, 122-123, 151, 
159-161, 262-264 
cache (physical), 122-123, 151 
computer component of, 14, 68 
internal (cell), 159-168 
kernel (nucleus), 263 
OS resource management, 262-264 
Mantissa, 328 
Mapping functions, 124-135 
associative, 129-131 
cache memory, 124-135 
direct, 125-129 
set-associative, 131-135 
Medium-term scheduling, 272 
Memory address register (MAR), 
21, 68, 72, 438 
Memory bank, 168 
Memory buffer register (MBR), 21, 
68, 72, 438 
Memory cycle time, 26-28, 113 
Memory instructions, 353 
Memory management, 260, 270, 
277-298, 377 
access control, 298 
addresses, 280-281, 288-289, 294-295 
Advanced RISC Machine (ARM), 
293-298 
compaction, 280 
formats, 290, 295-298 
input/output (I/O), 260, 270, 
277-298 
Intel Pentium II processor, 288-293 
Intel x86 machine instructions, 377 
multiprogramming and, 270, 277 
operating systems (OS), 260, 270, 
277-298 
paging, 281-285, 291-293 
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parameters, 292, 297 
partitioning, 278-281 
segmentation, 287-288, 289-291 
swapping, 277-278 
translation lookaside buffer 
(TLB), 285-287, 293-294 
virtual memory, 283-285, 294-295 
Memory management unit (MMU), 
122-123, 294-295 
Memory-mapped I/O, 227 
Memory modules, 83 
Memory protection, OS, 267 
Memory systems, 110-216 
access, 113, 116 
addressable units, 112 
cache, 110-157 
capacity, 112 
external, 184-216 
hierarchy, 114-118 
hit, 116 
internal, 158-183 
locality of reference, 117, 151-153 
location, 111-112 
miss, 116 
organization, 114 
performance, 113-114, 116, 151-157 
physical characteristics of, 114 
secondary (auxiliary), 117 
two-level, 151-157 
unit of transfer, 112-113 
word, 112 
MESI (modified, exclusive, shared, 
or invalid) protocol, 642-645 
Microelectronics, development of, 
29-31 
Microinstruction bus (MIB), 609-610 
Microinstructions, 587-590, 596-624 
control memory, 589-590 
encoding, 606-607 
execution, 602-614 
formats, 611-612, 615-617 
horizontal, 589, 605-606 
IBM 3033 processor, 600-601, 
612-614 
LSI-11 processor, 601-602, 607-612 
sequencing, 596-602, 617-620 
taxonomy of, 602-606 
TI 8800 Software Development 
Board (SBD), 614-624 
vertical, 592, 605-606 
Wilkes control, 595-596 
Micro-operations (micro-ops), 
538-540, 543-544, 563-569 
allocation, 543 
control unit (CU), 563-569 
execute cycle, 567-568 
fetch cycle, 564-566 
front end generation of, 539-540 
indirect cycle, 566 
instruction cycle, 568-569 
interrupt cycle, 566-567 
Pentium 4 processor, 538-540, 
543-544 
queuing, 544 
scheduling and dispatching, 544 
superscalar processors, 538-540 
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Microprocessors, 36-38, 39-40, 439-440 
development of, 36-38 
Intel 80386 registers, 439-440 
Intel 8086 registers, 439-440 
Motorola MC68000 registers, 
439-440 
register organizations, 439-440 
speed (performance of), 39-40 
Microprogrammed control units, 
481, 559-560, 586-626 
advantages and disadvantages of, 
594, 496 
control memory, 589-590 
decoder, 592 
development of, 587 
execution, 602-614 
firmware, 588 
IBM 3033 processor, 612614 
LSI-11 processor, 601-602, 607-612 
microarchitecture, 590-592 
microinstructions, 587-590, 596-624 
sequencing, 596-602 
TI 8800 Software Development 
Board (SBD), 614-624 
Wilkes control, 592-596 
Microprogramming language, 588 
Migratory lines, 703-704 
Millions of floating-point operations 
per second (MFLOPS) rate, 
52-53 
Millions of instructions per second 
(MIPS) rate, 52-53 
MIPS R4000 microprocessor, 
504-511 
instruction format, 507 
instruction set, 505-507 
pipelining instructions, 507-511 
Mirrored disk performance (RAID 
level 1), 196-197, 200-201 
MMxX (mutimedia task), Intel x86 
processors, 387-381, 466-467 
instructions, 378-381 
registers, 466-467 
Mnemonics, 351 
Monitor (simple batch OS), 265-267 
Monitor arrangement, I/O, 221 
Moore’s law, 30-31 
Motorola MC68000 microprocessor 
registers, 439-440 
Movable-head disk, 189 
Multicore computers, 627, 684-706 
ARM11 MPCore, 699-704 
chip multiprocessors as, 685 
database scaling, 691-693 
hardware performance, 685-690 
Intel Core Duo, 696-698 
Intel Core 17, 698-699 
Intel x86 organization, 696-699 
organization, 694-696 
parallelism increase, 685-687 
power consumption, 687-690 
software performance, 690-694 
speedup time increase, 690-691 
threading, 693-694 
Multicore processors, 44 
Multilevel cache memory, 138-138 


Multiple instruction, multiple data 
(MIMD) stream, 630-632 
Multiple instruction, single data 
(MISD) stream, 630-632 
Multiple interrupt lines, I/O, 231 
Multiple parallel processing, 669-670 
Multiple platters, magnetic disks, 190 
Multiple streams, pipelining, 454 
Multiple zone recording, 188 
Multiplexor, 28 
Multiplexor channel, 243-244 
Multiplication, 317-324, 337-340 
Booth’s algorithm, 322-324 
floating-point numbers, 337-340 
twos complement, 319-324 
unsigned integers, 317-319 
Multiplier quotient (MQ), 21 
Multipoint interfaces, 245-246 
Multiprocessor OS design, SMP 
considerations for, 636-637 
Multiprogramming operating system 
(OS), 264, 267-270, 277 
batches, 267-270 
memory management and, 270, 277 
uniprogramming compared to, 
264, 270 
Multitasking, operating systems 
(OS), 268 
Multithreading, 629, 646-653 
chip multiprocessing, 629, 651-653 
explicit, 647-651 
implicit, 646-647 
parallel processing, 629, 646-653 
process, 646-647 
switches, 647 
thread, 647 


NaNs, IEEE standards, 341 
Negation, integers, 313-314 
Network layer, 253 
Noncacheable memory approach, 137 
Nonredundant disk performance 
(RAID level 0), 196-200 
Nonremovable disk, 189 
Nonuniform memory access 
(NUMA), 627, 629-631, 660-663 
advantages and disadvantages 
of, 663 
cache-coherent (CC-NUMA), 
660-663 
motivation, 660-661 
organizations, 661-663 
parallel processor architecture, 
629-631 
uniform memory access 
(UMA), 660 
Normalized numbers, 328-329 
Numerical data operands,356-357 


Offset addressing, ARM, 411 
Opcode, see Operations (opcode) 
Operands, 303, 350-351, 356-358, 
485-486 
characters, 357-358 
high-level language (HLL), 485-486 
logical data, 358 


machine instructions, 350-351 
numbers, 356-357 
packed decimal representation, 
357-358 
reduced instruction set computers 
(RISC), 485-486 
Operating system (OS), 64, 259-302 
Advanced RISC Machine (ARM) 
memory management, 
293-298 
batch, 264, 265-270 
computer system support, 64, 
259-302 
evolution of, 264-265 
functions, 260-271 
Intel Pentium II memory 
management, 288-293 
interactive, 264 
interrupts, 267 
memory management, 260, 270, 
277-298 
memory protection, 267 
multiprogramming, 264, 267-270 
objectives, 260-261 
privileged instructions, 267 
resource management, 262-264, 
269-270 
scheduling, 260, 264, 271-277 
setup time, 264-265 
time-sharing, 267, 270-271 
uniprogramming, 264, 27 
user/computer interfacing, 261-262 
utilities, 261-262 
Operations (opcode), 20, 24, 303, 
350, 362-384, 484-485 
Advanced RISC Machine (ARM), 
381-384 
arithmetic, 362, 365-366 
computer instructions, 20, 24 
conversion, 363, 368-369 
data transfer, 362, 364-365 
high-level language (HLL), 484-485 
input/output (I/O), 364, 369 
Intel x86, 374-381 
logical, 363, 366-368 
machine instructions, 350, 362-384 
reduced instruction set computers 
(RISC), 484-485 
system control, 364, 369 
transfer of control, 364, 369-374 
Optical memory systems, 203-210 
Blu-ray DVD, 205, 210 
compact disk (CD), 203, 205-209 
digital versatile disk (DVD), 205, 
208-210 
high-definition optical disks (HD 
DVD), 210 
types of, 205 
Original equipment manufacturers 
(OEM), 34 
Orthogonality, 418 
Out-of-order execution, 530-533, 
543-544 
Output dependency, parallelism, 
530-532 
Overflow, 314-315, 330, 334 


Packed decimal representation, 
357-358 
Page fault, 283 
Page tables, 282, 284-285, 294-295 
Pages, I/O memory, 281-282 
Paging, 281-285, 291-293 
demand, 283-284 
frame allocation, 281-282 
I/O memory management, 
281-285, 291-293 
page replacement, 283-284 
page tables, 282, 284-285 
Pentium II processor, 291-293 
virtual memory, 283-285 
Parallel I/O interfaces, 244-245 
Parallel organization, 627-706 
cache coherence, 629, 640-645 
chip multiprocessing, 629, 648-653 
clusters, 629, 653-660 
multicore computers, 627, 684-706 
multiple processor organizations, 
630-631 
multithreading, 629, 646-653 
nonuniform memory access 
(NUMA), 627, 629-631, 
660-663 
parallel processing, 627-683 
symmetric multiprocessors (SMP), 
627, 629-640, 659-660 
vector computation, 664-676 
Parallel recording, 211 
Parallelism, 304, 522-557, 656, 
685-687 
cluster applications, 656 
instruction issue policy, 529-533 
instruction-level, 304, 522-557 
limitations, 526-528, 530-533 
machine, 528-529, 534-536 
multicore computer increase, 
685-687 
Parameters, magnetic disks, 191-194 
Parametric computing, 657 
Parity bits, 170 
Partial remainder, 324-325 
Partitioning, I/O memory 
management, 278-281 
Passive standby clustering method, 
654-655 
PCL, see Peripheral component 
interconnection (PCI) 
PDP-8 instruction format design, 
416-417 
PDP-11 instruction format design, 
418-419 
PDP-10 instruction format design, 
417-418 
Pentium 4 processor, 140-143, 
538-544, 651 
allocation, 543 
chip multiprocessing, 651 
drive, 540, 542 
floating-point execution unit, 544 
front end, 539-542 
instruction-level parallelism and, 
538-544 
integer execution unit, 544 


micro-operations (micro-ops), 
538-540, 543-544 
organization, 140-143 
out-of-order execution logic, 
543-544 
register renaming, 543 
superscalar design, 538-544 
trace cache fetch, 540, 542 
trace cache net instruction pointer, 
540-542 
Pentium II processor, 288-293 
address spaces, 288-289 
formats for memory 
management, 290 
I/O memory management, 288-293 
paging, 291-293 
parameters for memory 
management, 292 
segmentation, 289-291 
virtual address fields, 291 
Peripheral component 
interconnection (PCI), 95-104 
arbitration, 102-104 
bus interconnection structure, 95-98 
commands, 99-100 
configuration, 95-96 
data transfers, 100-102 
grant (GNT) signal, 102-104 
request (REQ) signal, 102-104 
signal lines, 97-98 
special interest group (SIG), 95 
Peripheral (external) devices, I/O, 
219-221 
Phase change, 208 
Physical address, 281, 282 
Physical dedication, 90 
Physical layer, 247-248, 253 
Pipelining, 444-461, 481, 500-504, 
507-511, 525-526, 551-552, 
666-669 
branch prediction, 455-459 
branches and, 454-459 
bubble, 451 
Cortex-A8 processor, 551-552 
cycle time, 450 
delayed branch, 459, 501-503 
delayed load, 503 
development of, 481 
floating-point instructions, 
551-552, 667-669 
hazards, 451-454 
instruction prefetch (fetch 
overlap), 445, 454 
Intel 80486 processor, 460-461 
loop buffer, 454-455 
loop unrolling, 503-504 
MIPS R4000 microprocessor, 
507-511 
multiple streams, 454 
optimization, 501-504 
performance, 450-451 
processor instructions, 444-461 
RISC instructions, 500-504, 
507-511 
single-instruction multiple-data 
(SIMD) instructions, 551-552 
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speedup factor, 450-451 
strategy, 444-449 
superpipelined approach, 525-526 
superscalar approach compared 
to, 525-526 
vector computations and, 666-669 
Pits, compact disks, 206 
Platters, 185, 189-190 
Point-to-point interfaces, 245 
Pollack’s rule, 690 
POP stack operation, 391 
Postfix notation, 393-394 
Postindexing, 407, 411 
Power consumption, 687-690 
Power density, 42-43 
Power management logic, 697-698 
Preindexing, 407, 411 
Privileged instructions, 267 
Procedural dependency, 
parallelism, 528 
Procedure calls, 371-374, 377, 486-487 
control transfer instructions, 
371-374 
high-level language (HLL), 
486- 
Intel x86 call/return 
instructions, 377 
reduced instruction set computers 
(RISC), 486-487 
stack implementation of, 373-374 
Process, 271-276, 646-647 
concept of, 271 
control block, 273-274 
execution, 274-276, 646 
interrupt, 275 
multithreading, 646-647 
resource ownership, 646 
scheduling, 271-276, 646 
states, 272-274 
switch, 647 
Processors, 14-15, 84, 222-223, 304, 
423-479 
Advanced RISC Machine (ARM) 
organization, 469-475 
arithmetic and logic unit (ALU), 
15, 434-435 
communication, 84, 222-223 
control signals, 570-574, 577-578 
control unit (CU), 15, 569-581 
functional requirements, 569-570 
T/O modules, 84, 222-223 
instruction cycle, 440-444 
Intel 8085, 575-581 
Intel x86 organization, 461-469 
internal organization, 574-575 
interrupt processing, 467-469, 
474-475 
modes, ARM, 471-472 
pipelining instructions, 444-461 
registers, 15, 434-440, 461-467, 
472-474 
requirements of, 433-435 
signals, 84 
structure and function, 304, 432-479 
system interconnection (bus), 
14-15, 84, 434-435 
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Program counter (PC), 21, 69-70, 438 
Program status word (PSW), 438 
Programmable read-only memory 
(PROM), 160, 163 
Programmed I/O, 218, 224-228, 
234-236 
commands, 225-226 
drawbacks of, 236 
execution, 218, 224-226 
instructions, 226-228 
Intel 82C55A programmable 
peripheral interface, 234-236 
interrupt-driven I/O and, 224-226, 
234-236 
isolated, 227 
memory-mapped, 227 
PUSH stack operation, 391 


Queues, 276-278, 544 

V/O, 276-277 

intermediate, 277-278 

long- and short-term, 276 

memory management swapping, 

277-278 

micro-operations (micro-ops), 544 

processor scheduling, 276-277 
Quick Path Interconnect (QPI), 699 
Quiet NaN, 341 


Radix point, 307 
RAID, see Redundant Array of 
Independent Disks (RAID) 
Rambus DRAM (RDRAM), 177 
Random access, 113 
Random-access memory (RAM), 
160-161 
Rate metric measures, 55-56 
Read hit/miss, 644 
Read mechanisms, magnetic disks, 
185-186 
Real memory, 284 
Read-only memory (ROM), 160, 
162-163 
Read-with-intent-to-modify 
(RWITM), 644-645 
Reduced instruction set computers 
(RISC), 2, 304, 480-521 
addressing mode simplicity, 
496-497 
architecture, 494-500 
CISC and superscalar systems 
compared to, 483 
compiler-based register 
optimization, 492-494 
complex instruction set computer 
(CISC) architecture 
compared to, 498-500, 517- 
518 
development of, 482 
high-level language (HLL) and, 
482- 
instruction execution, 482-487 
instruction formats, 497-498, 507, 
515-517 
instruction sets, 505-507, 513-515 
machine cycle instructions, 496 


MIPS R4000 microprocessor, 
504-511 
operands, 485-486 
operations, 484-485 
pipelining instructions, 500- 504, 
507-511 
procedure calls, 486-487 
register-to-register characteristics, 
496-497 
registers, 487-494, 511-513 
Scalable Processor Architecture 
(SPARC), 511-517 
Redundant Array of Independent 
Disks (RAID), 185, 194-205 
bit-interleaved parity (level 3), 
196-198, 201-202 
block-level distributed parity 
(level 5), 196-198, 203 
block-level parity (level 4), 
196-198, 202-203 
characteristics of, 194-197 
dual redundancy (level 6), 196- 
198, 23 
Hamming code, redundant via 
(level 2), 196-197, 201 
levels, 195-197, 204-205 
mirrored (level 1), 196-197, 200-201 
nonredundant (level 0), 196-200 
redundancy, 3, 201-202 
striping (level 0), 196-200 
Redundant disk performance via 
Hamming code (RAID level 2), 
196-197, 201 
Register renaming, 533-534, 543 
Register-to-register organization, 
496-497, 671-647 
Registers, 15, 20-21, 402-405, 434- 
440, 461-467, 472-474, 487-494, 
511-513, 619, 671-674 
address, 436 
addressing mode, 402-405 
Advanced RISC Machine (ARM) 
organization, 472-474 
cache memory compared to, 
490-492 
compiler-based optimization, 
492-494 
condition codes (flags), 436-437 
control, 435, 437-439, 464-466 
current program status (CPSR), 
472-474 
data, 436 
EFLAGS, 462-464 
general-purpose, 436, 472 
global variable storage, 490 
IAS computer memory and, 20-21 
IBM 3090 vector facility, 671-674 
indirect addressing mode, 
402-403, 405 
instruction (IR), 21, 438 
instruction buffer (IBR), 21, 438 
Intel 80386 microprocessor, 
439-440 
Intel 8086 microprocessor, 439-440 
Intel x86 organization, 461-467 
larger file approaches, 487-492 


memory address (MAR), 21, 438 
memory buffer (MBR), 21, 438 
microprocessor organizations, 
439-440 
MMxX, 466-467 
Motorola MC68000 
microprocessor, 439-440 
program counter (PC), 21, 438 
program status word (PSW), 438 
reduced instruction set computers 
(RISC), 487-494, 511-513 
registers, 15, 434-440 
Scalable Processor Architecture 
(SPARC), 511-513 
status, 435, 437-439 
TI 8800 SBD microsequencer, 619 
user-visible, 435-437 
windows, 488-490, 511-513 
Relative address, 282, 406 
Removable disk, 189 
Replacement algorithms, cache 
memory, 136 
Request (REQ) signal, PCI, 102-104 
Resident monitor, 265 
Residual control, 601 
Resistive-capacitive (RC) delay, 43 
Resource conflict, parallelism, 528 
Resource encoding, 607 
Resource hazards, pipelining, 451-452 
Resource management, OS, 262-264, 
269-270 
Resource ownership process, 646 
Reverse Polish notation, 393-394 
RISC, see Reduced instruction set 
computers (RISC) 
Rotate (cyclic shift) operation, 368 
Rotating interrupt mode, 234 
Rotational delay (latency), magnetic 
disks, 192-193 
Rotational positional sensing 
(RPS), 192 
Rounding, IEEE standards, 339-340 
Router, InfiniBand, 251 


Saturation arithmetic, 387 
Scalable Processor Architecture 
(SPARC), 511-517 
instruction format, 515-517 
instruction set, 513-515 
register set, 511-513 
Scheduling, 260, 264-265, 271-277, 
544, 646 
efficiency of, 264-265 
interrupt process, 275 
long-term, 271-272 
medium-term, 272 
micro-operations (micro-ops), 544 
multithreading, 646 
operating system (OS) function, 
260, 264, 271-277 
process, 271-274, 646 
queues, 276-277 
short-term, 272-277 
state of a process, 272-274 
techniques, 274-277 
Secondary (auxiliary) memory, 117 


Sectors, magnetic disks, 185 
Seek time, magnetic disks, 192-193 
Segmentation, Pentium IT processor, 
287-291 
Selector channel, 243-244 
Semantic gap, 482-484 
Semiconductors, 34-36, 159-183. See 
also Internal memory 
Sequencing, 570, 596-602, 617-620 
address generation, 600-601 
branch control logic techniques, 
597-600 
control unit (CU), 570, 596-602 
design considerations, 596-597 
IBM 3033 processor, 600-601 
LSI-11 processor, 601-602 
microinstructions, 596-602 
TI 8800 SDB microsequencer, 
617-620 
Sequential access, 113 
Sequential organization, magnetic 
disks, 193-194 
Serial I/O interfaces, 244-245 
Serial recording, 211 
Serpentine recording, 211-212 
Server clustering approaches, 622 
Set-associative mapping, 131-135 
Setup time, operating system (OS) 
efficiency, 264-265 
Short-term scheduling, 272-277 
Sign-magnitude representation, 308 
Signal lines, PCI, 97-98 
Signaling NaN, 341 
Significand, 328, 334 
Simultaneous multithreading 
(SMT), 648-651 
Single error-correcting (SEC) 
code, 173 
Single-instruction multiple-data 
(SIMD), 378-381, 551-552, 
630-632 
Intel x86 instructions, 378-381 
pipelining instructions, 551-552 
stream, 630-632 
Single instruction, single data 
(SISD) stream, 630-632 
Single large expensive disk 
(SLEP), 195 
Single-sided disk, 189 
Single-system image, 657 
Skip instructions, 371 
Small Computer System Interface 
(SCSI), 246 
Small-scale integration (SSI), 30-31 
SMP, see Symmetric multiprocessors 
(SMP) 
Snoop control unit (SCU), 700, 
703-704 
Snoopy protocols, cache 
coherence, 642 
Soft error, 169 
Software, 26, 67-68, 640-641, 
690-694 
cache coherence solutions, 640-641 
database scaling applications, 
691-692 


development of, 26 
multicore computer performance, 
690-694 
system components, 67-68 
Valve game threading, 693-694 
Software Development Board 
(SBD), 614-624 
Software poll technique, I/O, 232 
Solid-state devices, 25 
Spatial locality, 153 
Special interest group (SIG), PCI, 95 
Special mask interrupt mode, 234 
Speed metric measures, 55 
Speedup factor, 450-451, 690-691 
Split cache memory, 140 
Stacks, 354, 373-374, 390-395, 402- 
403, 407-408, 619 
addressing mode, 402-403, 407-408 
central processing unit (CPU) 
instructions, 354, 390-395 
expression evaluation, 393-395 
frames, 374 
operations, 390-391 
pointer (SP), 392 
procedure call implementation, 
373-374 
processor implementation, 391-393 
TI 8800 SBD microsequencer, 619 
zero-address instructions, 354 
State diagrams, instruction cycles, 73, 
80, 442 
State of a process, 272-274 
Static random-access memory 
(SRAM), 162 
Status flags, 378 
Status registers, 435, 437-439 
Status signals, I/O, 220 
Stored-program concept, 18 
Striped disk performance (RAID 
level 0), 196-200 
Subactions, Fire Wire, 249 
Subnets, InfiniBand, 251 
Subtraction, 314-317, 334-337 
floating-point numbers, 334-337 
twos complement integers, 314-317 
Superscalar processors, 304, 483, 
522-557 
Advanced RISC Machine (ARM) 
Cortex-A8, 544-552 
branch prediction, 536 
CISC and RISC systems 
compared to, 483 
committing (retiring) instructions, 
536-537 
design issues, 528-534 
development of, 523-524 
execution of programs, 536-537 
implementation of programs, 537 
in-order completion, 530 
instruction issue policy, 529-533 
instruction-level parallelism and, 
304, 522-557 
Intel Pentium 4, 538-544 
out-of-order completion, 530-533 
parallelism limitations, 526-528, 
530-533 
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register renaming, 533-534 
superpipelined approach 
compared to, 525-526 
Swapping, I/O memory 
management, 277-278 
Switch, 251, 647 
Symmetric multiprocessors (SMP), 
627, 629-640, 659-660 
clusters compared to, 659-660 
IBM z990 mainframes, 637-640 
mainframe, 637-640 
multiprocessor OS design 
considerations, 636-637 
organization, 634-636 
parallel processor architecture, 
629-631 
switched interconnections, 639 
system characteristics, 632-634 
two-level shared caches, 639-640 
Synchronous DRAM (SDRAM), 
174-177 
Synchronous timing, 91-92 
Syndrome words, 170-171 
System control element (SCE), 
638-639 
System control operations, 364, 369 
System interconnection (bus), 14-15, 
84, 434-435 
System Performance Evaluation 
Corporation (SPEC), 54-56 


Tags, cache memory, 120 
Target channel adapter (TCA), 251 
Temporal locality, 153 
Test instructions, 353 
Texas Instruments, see TI 8800 
Software Development Board 
(SBD) 
Thermal control units, 696-697 
Thrashing, 128, 283 
Thread, 647 
Threading, multicore computers, 
693-694 
Thumb instruction set, ARM, 
425-426 
TI 8800 Software Development 
Board (SBD), 614-624 
arithmetic and logic unit (ALU), 
621-624 
control fields, 619-620 
microinstruction format, 615-617 
microprogrammed control units, 
614-624 
microsequencer, 617-620 
registers, 619 
stacks, 619 
Time multiplexing, 90 
Time-sharing operating systems 
(OS), 270-271 
Timing 91-93, 108-109, 222, 578-581 
asynchronous, 92-93 
bus interconnection, 91-93 
diagrams, 108-109 
Intel 8085 processor, 578-581 
1/O modules, 222 
synchronous, 91-92 
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Top-level computer structure, 14-15, 
63-109 
execute cycle, 22, 69-74 
fetch cycle, 21, 69-74 
functions, 10-13, 65-83 
instruction cycle, 21-24, 69-74, 76-83 
interconnections, 14-15, 83-104 
timing diagrams, 108-109 
Trace cache fetch, Pentium 4 
processor, 540, 542 
Trace cache net instruction pointer, 
Pentium 4 processor, 540-542 
Tracks, magnetic disks, 185, 190-191 
Trailing edge, 108-109 
Transaction layer, 248 
Transducer, I/O, 221 
Transfer of control operations, 364, 
369-374 
Transfer rate, 113-114 
Transfer time, magnetic disks, 192-193 
Transistors, development of, 25-28 
Translation lookaside buffer (TLB), 
285-287, 293-294 
Transport layer, 253 
True data (flow) dependency, 
parallelism, 526-528 
Two-level cache memory, 151-157, 
639-640 
Twos complement, 308-310, 312-327 
arithmetic, 312-317 
division restoring algorithm, 
325-327 
geometric depiction of, 316 
multiplication, 319-324 
operation, 313 
representation, 308-310 
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Vacuum tubes, development of, 17-25 
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